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Computer Architecture Formulas 


1. CPU time = Instruction count x Clock cycles per instruction x Clock cycle time 


Execution time a l 


2. Amdahl’s Law: Speedupyycray = Fraction enhanced 


Execution time, (1 — Fraction 
Speedup enhanced 


enhanced) + 
3. Powe jynamic = 1/2 X Capacitive load x Voltage” x Frequency switched 


4. Power tatic = Cutrentyyaie X Voltage 
5. Average memory-access time = Hit time + Miss rate X Miss penalty 


6. Availability = Mean Time To Fail /(Mean Time To Fail + Mean Time to Repair) 


fec o it areg ` ie s u 
T: Die yield = Water yieldx (1 r Defects per unit area x Die s=) 


o 
where Wafer yield accounts for wafers that are so bad they need not be tested and © is a fitted parameter 
that approximates the number of masking levels critical to die yield (usually & = 4.0) 


8. Average memory-access time = Hit time + Miss rate x Miss penalty 
9. Misses per instruction = Miss rate X Memory access per instruction 
10. Cache index size: 20$% = Cache size/(Block size X Set associativity) 


11. Means—arithmetic (AM), weighted arithmetic (WAM), and geometric (GM): 


n n nj n n 
l ney À noca ; aa. ran ee y 
AM = = 2 Time; WAM = 2 Weight, x Time; GM = Llio - en x x ict 
= i= t= = 


where Time; is the execution time for the ith program of a total of n in the workload, Weight; is the 
weighting of the ith program in the workload. 





I In(Time;) — In(Geometic mean))” 


i=l 
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12, Geometric standard deviation = exp | 
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Rules of Thumb 

1. Amdahl/Case Rule: A balanced computer system needs about 1 MB of main memory capacity and 1 
megabit per second of I/O bandwidth per MIPS of CPU performance. 

2. 90/10 Locality Rule: A program executes about 90% ofits instructions in 10% of its code. 

3. Bandwidth Rule: Bandwidth grows by at least the square of the improvement in latency. 


4. 2:1 Cache Rule: The miss rate of a direct-mapped cache of size N is about the same as a two-way set- 
associative cache of size N/2. 


5. Dependability Rule: Design with no single point of failure. 


In Praise of Computer Architecture: A Quantitative Approach 
Fourth Edition 


"The multiprocessor is here and it can no longer be avoided. As we bid farewell 
to single-core processors and move into the chip multiprocessing age, it is great 
timing for a new edition of Hennessy and Patterson's classic. Few books have had 
as significant an impact on the way their discipline is taught, and the current edi- 
tion will ensure its place at the top for some time to come." 


—Luiz Andre Barroso, Google Inc. 


"What do the following have in common: Beatles’ tunes, HP calculators, choco- 
late chip cookies, and Computer Architecture! They are all classics that have 
stood the test of time." 


—Robert P. Colwell, Intel lead architect 


"Not only does the book provide an authoritative reference on the concepts that 
all computer architects should be familiar with, but it is also a good starting point 
for investigations into emerging areas in the field." 


—Krisztian Flautner, ARM Ltd. 


"The best keeps getting better! This new edition is updated and very relevant to 
the key issues in computer architecture today. Plus, its new exercise paradigm is 
much more useful for both students and instructors." 


—Norman P. Jouppi, HP Labs 


"Computer Architecture builds on fundamentals that yielded the RISC revolution, 
including the enablers for CISC translation. Now, in this new edition, it clearly 
explains and gives insight into the latest microarchitecture techniques needed for 
the new generation of multithreaded multicore processors." 


—Marc Tremblay, Fellow & VP, Chief Architect, Sun Microsystems 


"This is a great textbook on all key accounts: pedagogically superb in exposing 
the ideas and techniques that define the art of computer organization and design, 
stimulating to read, and comprehensive in its coverage of topics. The first edition 
set a standard of excellence and relevance; this latest edition does it again." 


—AMilos Ercegovac, UCLA 
"They've done it again. Hennessy and Patterson emphatically demonstrate why 
they are the doyens of this deep and shifting field. Fallacy: Computer architecture 


isn't an essential subject in the information age. Pitfall: You don't need the 4th 
edition of Computer Architecture" 


—Michael D. Smith, Harvard University 


"Hennessy and Patterson have done it again! The 4th edition is a classic encore 
that has been adapted beautifully to meet the rapidly changing constraints of 
‘late-CMOS-era' technology. The detailed case studies of real processor products 
are especially educational, and the text reads so smoothly that it is difficult to put 
down. This book is a must-read for students and professionals alike!" 


—Pradip Bose, IBM 


"This latest edition of Computer Architecture is sure to provide students with the 
architectural framework and foundation they need to become influential archi- 
tects of the future." 


— Ravishankar Iyer, Intel Corp. 


"As technology has advanced, and design opportunities and constraints have 
changed, so has this book. The 4th edition continues the tradition of presenting 
the latest in innovations with commercial impact, alongside the foundational con- 
cepts: advanced processor and memory system design techniques, multithreading 
and chip multiprocessors, storage systems, virtual machines, and other concepts. 
This book is an excellent resource for anybody interested in learning the architec- 
tural concepts underlying real commercial products." 


—Gurindar Sohi, University of Wisconsin-Madison 

"I am very happy to have my students study computer architecture using this fan- 
tastic book and am a little jealous for not having written it myself." 

—AMateo Valero, UPC, Barcelona 

"Hennessy and Patterson continue to evolve their teaching methods with the 

changing landscape of computer system design. Students gain unique insight into 


the factors influencing the shape of computer architecture design and the poten- 
tial research directions in the computer systems field." 


—Dan Connors, University of Colorado at Boulder 


"With this revision, Computer Architecture will remain a must-read for all com- 
puter architecture students in the coming decade." 


—Wen-mei Hwu, University of Illinois at Urbana-Champaign 


"The 4th edition of Computer Architecture continues in the tradition of providing 
a relevant and cutting edge approach that appeals to students, researchers, and 
designers of computer systems. The lessons that this new edition teaches will 
continue to be as relevant as ever for its readers." 


—David Brooks, Harvard University 


"With the 4th edition, Hennessy and Patterson have shaped Computer Architec- 
ture back to the lean focus that made the 1st edition an instant classic." 


—Mark D. Hill, University of Wisconsin-Madison 
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Foreword 


by Fred Weber, President and CEO ofMetaRAM, Inc. 


I am honored and privileged to write the foreword for the fourth edition of this 
most important book in computer architecture. In the first edition, Gordon Bell, 
my first industry mentor, predicted the book's central position as the definitive 
text for computer architecture and design. He was right. I clearly remember the 
excitement generated by the introduction of this work. Rereading it now, with 
significant extensions added in the three new editions, has been a pleasure all 
over again. No other work in computer architecture—frankly, no other work I 
have read in any field—so quickly and effortlessly takes the reader from igno- 
rance to a breadth and depth of knowledge. 

This book is dense in facts and figures, in rules of thumb and theories, in 
examples and descriptions. It is stuffed with acronyms, technologies, trends, for- 
mulas, illustrations, and tables. And, this is thoroughly appropriate for a work on 
architecture. The architect's role is not that of a scientist or inventor who will 
deeply study a particular phenomenon and create new basic materials or tech- 
niques. Nor is the architect the craftsman who masters the handling of tools to 
craft the finest details. The architect's role is to combine a thorough understand- 
ing of the state of the art of what is possible, a thorough understanding of the his- 
torical and current styles of what is desirable, a sense of design to conceive a 
harmonious total system, and the confidence and energy to marshal this knowl- 
edge and available resources to go out and get something built. To accomplish 
this, the architect needs a tremendous density of information with an in-depth 
understanding of the fundamentals and a quantitative approach to ground his 
thinking. That is exactly what this book delivers. 

As computer architecture has evolved—from a world of mainframes, mini- 
computers, and microprocessors, to a world dominated by microprocessors, and 
now into a world where microprocessors themselves are encompassing all the 
complexity of mainframe computers—Hennessy and Patterson have updated 
their book appropriately. The first edition showcased the IBM 360, DEC VAX, 
and Intel 80x86, each the pinnacle of its class of computer, and helped introduce 
the world to RISC architecture. The later editions focused on the details of the 
80x86 and RISC processors, which had come to dominate the landscape. This lat- 
est edition expands the coverage of threading and multiprocessing, virtualization 
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and memory hierarchy, and storage systems, giving the reader context appropri- 
ate to today's most important directions and setting the stage for the next decade 
of design. It highlights the AMD Opteron and SUN Niagara as the best examples 
of the x86 and SPARC (RISC) architectures brought into the new world of multi- 
processing and system-on-a-chip architecture, thus grounding the art and science 
in real-world commercial examples. 

The first chapter, in less than 60 pages, introduces the reader to the taxono- 
mies of computer design and the basic concerns of computer architecture, gives 
an overview of the technology trends that drive the industry, and lays out a quan- 
titative approach to using all this information in the art of computer design. The 
next two chapters focus on traditional CPU design and give a strong grounding in 
the possibilities and limits in this core area. The final three chapters build out an 
understanding of system issues with multiprocessing, memory hierarchy, and 
storage. Knowledge of these areas has always been of critical importance to the 
computer architect. In this era of system-on-a-chip designs, it is essential for 
every CPU architect. Finally the appendices provide a great depth of understand- 
ing by working through specific examples in great detail. 

In design it is important to look at both the forest and the trees and to move 
easily between these views. As you work through this book you will find plenty 
of both. The result of great architecture, whether in computer design, building 
design or textbook design, is to take the customer's requirements and desires and 
return a design that causes that customer to say, "Wow, I didn't know that was 
possible." This book succeeds on that measure and will, I hope, give you as much 
pleasure and value as it has me. 
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Preface 


Why We Wrote This Book 


Through four editions of this book, our goal has been to describe the basic princi- 
ples underlying what will be tomorrow's technological developments. Our 
excitement about the opportunities in computer architecture has not abated, and 
we echo what we said about the field in the first edition: "It is not a dreary science 
of paper machines that will never work. No! It's a discipline of keen intellectual 
interest, requiring the balance of marketplace forces to cost-performance-power, 
leading to glorious failures and some notable successes." 

Our primary objective in writing our first book was to change the way people 
learn and think about computer architecture. We feel this goal is still valid and 
important. The field is changing daily and must be studied with real examples 
and measurements on real computers, rather than simply as a collection of defini- 
tions and designs that will never need to be realized. We offer an enthusiastic 
welcome to anyone who came along with us in the past, as well as to those who 
are joining us now. Either way, we can promise the same quantitative approach 
to, and analysis of, real systems. 

As with earlier versions, we have strived to produce a new edition that will 
continue to be as relevant for professional engineers and architects as it is for 
those involved in advanced computer architecture and design courses. As much 
as its predecessors, this edition aims to demystify computer architecture through 
an emphasis on cost-performance-power trade-offs and good engineering design. 
We believe that the field has continued to mature and move toward the rigorous 
quantitative foundation of long-established scientific and engineering disciplines. 


This Edition 


The fourth edition of Computer Architecture: A Quantitative Approach may be 
the most significant since the first edition. Shortly before we started this revision, 
Intel announced that it was joining IBM and Sun in relying on multiple proces- 
sors or cores per chip for high-performance designs. As the first figure in the 
book documents, after 16 years of doubling performance every 18 months, sin- 
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gle-processor performance improvement has dropped to modest annual improve- 
ments. This fork in the computer architecture road means that for the first time in 
history, no one is building a much faster sequential processor. If you want your 
program to run significantly faster, say, to justify the addition of new features, 
you're going to have to parallelize your program. 

Hence, after three editions focused primarily on higher performance by 
exploiting instruction-level parallelism (ILP), an equal focus of this edition is 
thread-level parallelism (TLP) and data-level parallelism (DLP). While earlier 
editions had material on TLP and DLP in big multiprocessor servers, now TLP 
and DLP are relevant for single-chip multicores. This historic shift led us to 
change the order of the chapters: the chapter on multiple processors was the sixth 
chapter in the last edition, but is now the fourth chapter of this edition. 

The changing technology has also motivated us to move some of the content 
from later chapters into the first chapter. Because technologists predict much 
higher hard and soft error rates as the industry moves to semiconductor processes 
with feature sizes 65 nm or smaller, we decided to move the basics of dependabil- 
ity from Chapter 7 in the third edition into Chapter 1. As power has become the 
dominant factor in determining how much you can place on a chip, we also 
beefed up the coverage of power in Chapter 1. Of course, the content and exam- 
ples in all chapters were updated, as we discuss below. 

In addition to technological sea changes that have shifted the contents of this 
edition, we have taken a new approach to the exercises in this edition. It is sur- 
prisingly difficult and time-consuming to create interesting, accurate, and unam- 
biguous exercises that evenly test the material throughout a chapter. Alas, the 
Web has reduced the half-life of exercises to a few months. Rather than working 
out an assignment, a student can search the Web to find answers not long after a 
book is published. Hence, a tremendous amount of hard work quickly becomes 
unusable, and instructors are denied the opportunity to test what students have 
learned. 

To help mitigate this problem, in this edition we are trying two new ideas. 
First, we recruited experts from academia and industry on each topic to write the 
exercises. This means some of the best people in each field are helping us to cre- 
ate interesting ways to explore the key concepts in each chapter and test the 
reader's understanding of that material. Second, each group of exercises is orga- 
nized around a set of case studies. Our hope is that the quantitative example in 
each case study will remain interesting over the years, robust and detailed enough 
to allow instructors the opportunity to easily create their own new exercises, 
should they choose to do so. Key, however, is that each year we will continue to 
release new exercise sets for each of the case studies. These new exercises will 
have critical changes in some parameters so that answers to old exercises will no 
longer apply. 

Another significant change is that we followed the lead of the third edition of 
Computer Organization and Design (COD) by slimming the text to include the 
material that almost all readers will want to see and moving the appendices that 
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some will see as optional or as reference material onto a companion CD. There 
were many reasons for this change: 


1. Students complained about the size of the book, which had expanded from 
594 pages in the chapters plus 160 pages of appendices in the first edition to 
760 chapter pages plus 223 appendix pages in the second edition and then to 
883 chapter pages plus 209 pages in the paper appendices and 245 pages in 
online appendices. At this rate, the fourth edition would have exceeded 1500 
pages (both on paper and online)! 


2. Similarly, instructors were concerned about having too much material to 
cover in a single course. 


3. As was the case for COD, by including a CD with material moved out of the 
text, readers could have quick access to all the material, regardless of their 
ability to access Elsevier's Web site. Hence, the current edition's appendices 
will always be available to the reader even after future editions appear. 


4. This flexibility allowed us to move review material on pipelining, instruction 
sets, and memory hierarchy from the chapters and into Appendices A, B, and 
C. The advantage to instructors and readers is that they can go over the review 
material much more quickly and then spend more time on the advanced top- 
ics in Chapters 2, 3, and 5. It also allowed us to move the discussion of some 
topics that are important but are not core course topics into appendices on the 
CD. Result: the material is available, but the printed book is shorter. In this 
edition we have 6 chapters, none of which is longer than 80 pages, while in 
the last edition we had 8 chapters, with the longest chapter weighing in at 127 
pages. 

5. This package of a slimmer core print text plus a CD is far less expensive to 
manufacture than the previous editions, allowing our publisher to signifi- 
cantly lower the list price of the book. With this pricing scheme, there is no 
need for a separate international student edition for European readers. 


Yet another major change from the last edition is that we have moved the 
embedded material introduced in the third edition into its own appendix, Appen- 
dix D. We felt that the embedded material didn't always fit with the quantitative 
evaluation of the rest of the material, plus it extended the length of many chapters 
that were already running long. We believe there are also pedagogic advantages 
in having all the embedded information in a single appendix. 

This edition continues the tradition of using real-world examples to demon- 
strate the ideas, and the "Putting It All Together" sections are brand new; in fact, 
some were announced after our book was sent to the printer. The "Putting It All 
Together" sections of this edition include the pipeline organizations and memory 
hierarchies of the Intel Pentium 4 and AMD Opteron; the Sun TI ("Niagara") 8- 
processor, 32-thread microprocessor; the latest NetApp Filer; the Internet 
Archive cluster; and the IBM Blue Gene/L massively parallel processor. 


Topic Selection and Organization 


As before, we have taken a conservative approach to topic selection, for there are 
many more interesting ideas in the field than can reasonably be covered in a treat- 
ment of basic principles. We have steered away from a comprehensive survey of 
every architecture a reader might encounter. Instead, our presentation focuses on 
core concepts likely to be found in any new machine. The key criterion remains 
that of selecting ideas that have been examined and utilized successfully enough 
to permit their discussion in quantitative terms. 

Our intent has always been to focus on material that is not available in equiva- 
lent form from other sources, so we continue to emphasize advanced content 
wherever possible. Indeed, there are several systems here whose descriptions 
cannot be found in the literature. (Readers interested strictly in a more basic 
introduction to computer architecture should read Computer Organization and 
Design: The Hardware/Software Interface, third edition.) 


An Overview of the Content 


Chapter 1 has been beefed up in this edition. It includes formulas for static 
power, dynamic power, integrated circuit costs, reliability, and availability. We go 
into more depth than prior editions on the use of the geometric mean and the geo- 
metric standard deviation to capture the variability of the mean. Our hope is that 
these topics can be used through the rest of the book. In addition to the classic 
quantitative principles of computer design and performance measurement, the 
benchmark section has been upgraded to use the new SPEC2006 suite. 

Our view is that the instruction set architecture is playing less of a role today 
than in 1990, so we moved this material to Appendix B. It still uses the MIPS64 
architecture. For fans of IS As, Appendix J covers 10 RISC architectures, the 
80x86, the DEC VAX, and the IBM 360/370. 

Chapters 2 and 3 cover the exploitation of instruction-level parallelism in 
high-performance processors, including superscalar execution, branch prediction, 
speculation, dynamic scheduling, and the relevant compiler technology. As men- 
tioned earlier, Appendix A is a review of pipelining in case you need it. Chapter 3 
surveys the limits of ILR New to this edition is a quantitative evaluation of multi- 
threading. Chapter 3 also includes a head-to-head comparison of the AMD Ath- 
lon, Intel Pentium 4, Intel Itanium 2, and IBM PowerS5, each of which has made 
separate bets on exploiting ILP and TLP. While the last edition contained a great 
deal on Itanium, we moved much of this material to Appendix G, indicating our 
view that this architecture has not lived up to the early claims. 

Given the switch in the field from exploiting only ILP to an equal focus on 
thread- and data-level parallelism, we moved multiprocessor systems up to Chap- 
ter 4, which focuses on shared-memory architectures. The chapter begins with 
the performance of such an architecture. It then explores symmetric and 
distributed-memory architectures, examining both organizational principles and 
performance. Topics in synchronization and memory consistency models are 
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next. The example is the Sun Tl ("Niagara"), a radical design for a commercial 
product. It reverted to a single-instruction issue, 6-stage pipeline microarchitec- 
ture. It put 8 of these on a single chip, and each supports 4 threads. Hence, soft- 
ware sees 32 threads on this single, low-power chip. 

As mentioned earlier, Appendix C contains an introductory review of cache 
principles, which is available in case you need it. This shift allows Chapter 5 to 
start with 11 advanced optimizations of caches. The chapter includes a new sec- 
tion on virtual machines, which offers advantages in protection, software man- 
agement, and hardware management. The example is the AMD Opteron, giving 
both its cache hierarchy and the virtual memory scheme for its recently expanded 
64-bit addresses. 

Chapter 6, "Storage Systems," has an expanded discussion of reliability and 
availability, a tutorial on RAID with a description of RAID 6 schemes, and rarely 
found failure statistics of real systems. It continues to provide an introduction to 
queuing theory and I/O performance benchmarks. Rather than go through a series 
of steps to build a hypothetical cluster as in the last edition, we evaluate the cost, 
performance, and reliability of a real cluster: the Internet Archive. The "Putting It 
All Together" example is the NetApp FAS6000 filer, which is based on the AMD 
Opteron microprocessor. 

This brings us to Appendices A through L. As mentioned earlier, Appendices 
A and C are tutorials on basic pipelining and caching concepts. Readers relatively 
new to pipelining should read Appendix A before Chapters 2 and 3, and those 
new to caching should read Appendix C before Chapter 5. 

Appendix B covers principles of ISAs, including MIPS64, and Appendix J 
describes 64-bit versions of Alpha, MIPS, PowerPC, and SPARC and their multi- 
media extensions. It also includes some classic architectures (80x86, VAX, and 
IBM 360/370) and popular embedded instruction sets (ARM, Thumb, SuperH, 
MIPS 16, and Mitsubishi M32R). Appendix G is related, in that it covers architec- 
tures and compilers for VLIW ISAs. 

Appendix D, updated by Thomas M. Conte, consolidates the embedded mate- 
rial in one place. 

Appendix E, on networks, has been extensively revised by Timothy M. Pink- 
ston and Jose Duato. Appendix F, updated by Krste Asanovic, includes a descrip- 
tion of vector processors. We think these two appendices are some of the best 
material we know of on each topic. 

Appendix H describes parallel processing applications and coherence proto- 
cols for larger-scale, shared-memory multiprocessing. Appendix I, by David 
Goldberg, describes computer arithmetic. 

Appendix K collects the "Historical Perspective and References" from each 
chapter of the third edition into a single appendix. It attempts to give proper 
credit for the ideas in each chapter and a sense of the history surrounding the 
inventions. We like to think of this as presenting the human drama of computer 
design. It also supplies references that the student of architecture may want to 
pursue. If you have time, we recommend reading some of the classic papers in 
the field that are mentioned in these sections. It is both enjoyable and educational 
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to hear the ideas directly from the creators. "Historical Perspective" was one of 
the most popular sections of prior editions. 

Appendix L (available at textbooks.elsevier.com/0123704901) contains solu- 
tions to the case study exercises in the book. 





Navigating the Text 


There is no single best order in which to approach these chapters and appendices, 
except that all readers should start with Chapter 1. If you don't want to read 
everything, here are some suggested sequences: 


e ILP: Appendix A, Chapters 2 and 3, and Appendices F and G 

e Memory Hierarchy: Appendix C and Chapters 5 and 6 

e Thread-arid Data-Level Parallelism: Chapter 4, Appendix H, and Appendix E 
e ISA: Appendices B and J 


Appendix D can be read at any time, but it might work best if read after the ISA 
and cache sequences. Appendix I can be read whenever arithmetic moves you. 


Chapter Structure 


The material we have selected has been stretched upon a consistent framework 
that is followed in each chapter. We start by explaining the ideas of a chapter. 
These ideas are followed by a "Crosscutting Issues" section, a feature that shows 
how the ideas covered in one chapter interact with those given in other chapters. 
This is followed by a "Putting It All Together" section that ties these ideas 
together by showing how they are used in a real machine. 

Next in the sequence is "Fallacies and Pitfalls," which lets readers learn from 
the mistakes of others. We show examples of common misunderstandings and 
architectural traps that are difficult to avoid even when you know they are lying in 
wait for you. The "Fallacies and Pitfalls" sections is one of the most popular sec- 
tions of the book. Each chapter ends with a "Concluding Remarks" section. 


Case Studies with Exercises 


Each chapter ends with case studies and accompanying exercises. Authored by 
experts in industry and academia, the case studies explore key chapter concepts 
and verify understanding through increasingly challenging exercises. Instructors 
should find the case studies sufficiently detailed and robust to allow them to cre- 
ate their own additional exercises. 

Brackets for each exercise (<chapter.section>) indicate the text sections of 
primary relevance to completing the exercise. We hope this helps readers to avoid 
exercises for which they haven't read the corresponding section, in addition to 
providing the source for review. Note that we provide solutions to the case study 
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exercises in Appendix L. Exercises are rated, to give the reader a sense of the 
amount of time required to complete an exercise: 


10] Less than 5 minutes (to read and understand) 
15 
[20 
[25 
[ 
[ 


[ 
[15] 5-15 minutes for a full answer 

] 15-20 minutes for a full answer 

] 1 hour for a full written answer 

30] Short programming project: less than 1 full day of programming 
40] Significant programming project: 2 weeks of elapsed time 


[Discussion] Topic for discussion with others 


A second set of alternative case study exercises are available for instructors 


who register at textbooks.elsevier.com/0123704901. This second set will be 
revised every summer, so that early every fall, instructors can download a new set 
of exercises and solutions to accompany the case studies in the book. 





Supplemental Materials 


The accompanying CD contains a variety of resources, including the following: 


Reference appendices—some guest authored by subject experts—covering a 
range of advanced topics 


Historical Perspectives material that explores the development of the key 
ideas presented in each of the chapters in the text 


Search engine for both the main text and the CD-only content 


Additional resources are available at textbooks. elsevier.com/0123704901. The 





instructor site (accessible to adopters who register at textbooks.elsevier.com) 
includes: 





Alternative case study exercises with solutions (updated yearly) 
Instructor slides in PowerPoint 
Figures from the book in JPEG and PPT formats 


The companion site (accessible to all readers) includes: 


Solutions to the case study exercises in the text 
Links to related material on the Web 


List of errata 


New materials and links to other resources available on the Web will be 


added on a regular basis. 
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Helping Improve This Book 


Finally, it is possible to make money while reading this book. (Talk about cost- 
performance!) If you read the Acknowledgments that follow, you will see that we 
went to great lengths to correct mistakes. Since a book goes through many print- 
ings, we have the opportunity to make even more corrections. If you uncover any 
remaining resilient bugs, please contact the publisher by electronic mail 
(ca4bugs@mkp.com). The first reader to report an error with a fix that we incor- 
porate in a future printing will be rewarded with a $1.00 bounty. Please check the 
errata sheet on the home page {textbooks.elsevier.com/0123704901) to see if the 
bug has already been reported. We process the bugs and send the checks about 
once a year or so, so please be patient. 

We welcome general comments to the text and invite you to send them to a 
separate email address at ca4¢comments @mkp.com. 











Concluding Remarks 


Once again this book is a true co-authorship, with each of us writing half the 
chapters and an equal share of the appendices. We can't imagine how long it 
would have taken without someone else doing half the work, offering inspiration 
when the task seemed hopeless, providing the key insight to explain a difficult 
concept, supplying reviews over the weekend of chapters, and commiserating 
when the weight of our other obligations made it hard to pick up the pen. (These 
obligations have escalated exponentially with the number of editions, as one of us 
was President of Stanford and the other was President of the Association for 
Computing Machinery.) Thus, once again we share equally the blame for what 
you are about to read. 


John Hennessy * David Patterson 
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Chapter One Fundamentals of Computer Design 


1.1 


Introduction 


Computer technology has made incredible progress in the roughly 60 years since 
the first general-purpose electronic computer was created. Today, less than $500 
will purchase a personal computer that has more performance, more main mem- 
ory, and more disk storage than a computer bought in 1985 for 1 million dollars. 
This rapid improvement has come both from advances in the technology used to 
build computers and from innovation in computer design. 

Although technological improvements have been fairly steady, progress aris- 
ing from better computer architectures has been much less consistent. During the 
first 25 years of electronic computers, both forces made a major contribution, 
delivering performance improvement of about 25% per year. The late 1970s saw 
the emergence of the microprocessor. The ability of the microprocessor to ride 
the improvements in integrated circuit technology led to a higher rate of improve- 
ment—roughly 35% growth per year in performance. 

This growth rate, combined with the cost advantages of a mass-produced 
microprocessor, led to an increasing fraction of the computer business being 
based on microprocessors. In addition, two significant changes in the computer 
marketplace made it easier than ever before to be commercially successful with a 
new architecture. First, the virtual elimination of assembly language program- 
ming reduced the need for object-code compatibility. Second, the creation of 
standardized, vendor-independent operating systems, such as UNIX and its 
clone, Linux, lowered the cost and risk of bringing out a new architecture. 

These changes made it possible to develop successfully a new set of architec- 
tures with simpler instructions, called RISC (Reduced Instruction Set Computer) 
architectures, in the early 1980s. The RISC-based machines focused the attention 
of designers on two critical performance techniques, the exploitation of instruction- 
level parallelism (initially through pipelining and later through multiple instruction 
issue) and the use of caches (initially in simple forms and later using more sophisti- 
cated organizations and optimizations). 

The RISC-based computers raised the performance bar, forcing prior archi- 
tectures to keep up or disappear. The Digital Equipment Vax could not, and so it 
was replaced by a RISC architecture. Intel rose to the challenge, primarily by 
translating x86 (or IA-32) instructions into RISC-like instructions internally, 
allowing it to adopt many of the innovations first pioneered in the RISC designs. 
As transistor counts soared in the late 1990s, the hardware overhead of translat- 
ing the more complex x86 architecture became negligible. 

Figure 1.1 shows that the combination of architectural and organizational 
enhancements led to 16 years of sustained growth in performance at an annual 
rate of over 50%—a rate that is unprecedented in the computer industry. 

The effect of this dramatic growth rate in the 20th century has been twofold. 
First, it has significantly enhanced the capability available to computer users. For 
many applications, the highest-performance microprocessors of today outper- 
form the supercomputer of less than 10 years ago. 


1.1 Introduction 
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Figure 1.1 Growth in processor performance since the mid-1980s. This chart plots performance relative to the 
VAX 11/780 as measured by the SPECint benchmarks (see Section 1.8). Prior to the mid-1980s, processor perfor- 
mance growth was largely technology driven and averaged about 25% per year. The increase in growth to about 
52% since then is attributable to more advanced architectural and organizational ideas. By 2002, this growth led to a 
difference in performance of about a factor of seven. Performance for floating-point-oriented calculations has 
increased even faster. Since 2002, the limits of power, available instruction-level parallelism, and long memory 
latency have slowed uniprocessor performance recently, to about 20% per year. Since SPEC has changed over the 
years, performance of newer machines is estimated by a scaling factor that relates the performance for two different 
versions of SPEC (e.g., SPEC92, SPEC95, and SPEC2000). 


Second, this dramatic rate of improvement has led to the dominance of 
microprocessor-based computers across the entire range of the computer design. 
PCs and Workstations have emerged as major products in the computer industry. 
Minicomputers, which were traditionally made from off-the-shelf logic or from 
gate arrays, have been replaced by servers made using microprocessors. Main- 
frames have been almost replaced with multiprocessors consisting of small num- 
bers of off-the-shelf microprocessors. Even high-end supercomputers are being 
built with collections of microprocessors. 

These innovations led to a renaissance in computer design, which emphasized 
both architectural innovation and efficient use of technology improvements. This 
rate of growth has compounded so that by 2002, high-performance microproces- 
sors are about seven times faster than what would have been obtained by relying 
solely on technology, including improved circuit design. 


4 


Chapter One Fundamentals of Computer Design 


However, Figure 1.1 also shows that this 16-year renaissance is over. Since 
2002, processor performance improvement has dropped to about 20% per year 
due to the triple hurdles of maximum power dissipation of air-cooled chips, little 
instruction-level parallelism left to exploit efficiently, and almost unchanged 
memory latency. Indeed, in 2004 Intel canceled its high-performance uniproces- 
sor projects and joined IBM and Sun in declaring that the road to higher perfor- 
mance would be via multiple processors per chip rather than via faster 
uniprocessors. This signals a historic switch from relying solely on instruction- 
level parallelism (ILP), the primary focus of the first three editions of this book, 
to thread-level parallelism (TLP) and data-level parallelism (DLP), which are 
featured in this edition. Whereas the compiler and hardware conspire to exploit 
ILP implicitly without the programmer's attention, TLP and DLP are explicitly 
parallel, requiring the programmer to write parallel code to gain performance. 

This text is about the architectural ideas and accompanying compiler 
improvements that made the incredible growth rate possible in the last century, 
the reasons for the dramatic change, and the challenges and initial promising 
approaches to architectural ideas and compilers for the 21st century. At the core 
is a quantitative approach to computer design and analysis that uses empirical 
observations of programs, experimentation, and simulation as its tools. It is this 
style and approach to computer design that is reflected in this text. This book was 
written not only to explain this design style, but also to stimulate you to contrib- 
ute to this progress. We believe the approach will work for explicitly parallel 
computers of the future just as it worked for the implicitly parallel computers of 
the past. 


Classes of Computers 


In the 1960s, the dominant form of computing was on large mainframes—com- 
puters costing millions of dollars and stored in computer rooms with multiple 
operators overseeing their support. Typical applications included business data 
processing and large-scale scientific computing. The 1970s saw the birth of the 
minicomputer, a smaller-sized computer initially focused on applications in sci- 
entific laboratories, but rapidly branching out with the popularity of time- 
sharing—multiple users sharing a computer interactively through independent 
terminals. That decade also saw the emergence of supercomputers, which were 
high-performance computers for scientific computing. Although few in number, 
they were important historically because they pioneered innovations that later 
trickled down to less expensive computer classes. The 1980s saw the rise of the 
desktop computer based on microprocessors, in the form of both personal com- 
puters and workstations. The individually owned desktop computer replaced 
time-sharing and led to the rise of servers—computers that provided larger-scale 
services such as reliable, long-term file storage and access, larger memory, and 
more computing power. The 1990s saw the emergence of the Internet and the 
World Wide Web, the first successful handheld computing devices (personal digi- 
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Feature Desktop Server Embedded 
Price of system $500-$5000 $5000-S5,000,000 $10-$100,000 (including network 
routers at the high end) 
Price of microprocessor $50-$500 $200-$ 10,000 $0.01-$100 (per processor) 
module (per processor) (per processor) 
Critical system design issues  Price-performance, Throughput, availability, Price, power consumption, 
graphics performance scalability application-specific performance 





Figure 1.2 A summary of the three mainstream computing classes and their system characteristics. Note the 
wide range in system price for servers and embedded systems. For servers, this range arises from the need for very 
large-scale multiprocessor systems for high-end transaction processing and Web server applications. The total num- 
ber of embedded processors sold in 2005 is estimated to exceed 3 billion if you include 8-bit and 16-bit microproces- 
sors. Perhaps 200 million desktop computers and 10 million servers were sold in 2005. 


tal assistants or PDAs), and the emergence of high-performance digital consumer 
electronics, from video games to set-top boxes. The extraordinary popularity of 
cell phones has been obvious since 2000, with rapid improvements in functions 
and sales that far exceed those of the PC. These more recent applications use 
embedded computers, where computers are lodged in other devices and their 
presence is not immediately obvious. 

These changes have set the stage for a dramatic change in how we view com- 
puting, computing applications, and the computer markets in this new century. 
Not since the creation of the personal computer more than 20 years ago have we 
seen such dramatic changes in the way computers appear and in how they are 
used. These changes in computer use have led to three different computing mar- 
kets, each characterized by different applications, requirements, and computing 
technologies. Figure 12 summarizes these mainstream classes of computing 
environments and their important characteristics. 


Desktop Computing 


The first, and still the largest market in dollar terms, is desktop computing. Desk- 
top computing spans from low-end systems that sell for under $500 to high-end, 
heavily configured workstations that may sell for $5000. Throughout this range 
in price and capability, the desktop market tends to be driven to optimize price- 
performance. This combination of performance (measured primarily in terms of 
compute performance and graphics performance) and price of a system is what 
matters most to customers in this market, and hence to computer designers. As a 
result, the newest, highest-performance microprocessors and cost-reduced micro- 
processors often appear first in desktop systems (see Section 1.6 for a discussion 
of the issues affecting the cost of computers). 

Desktop computing also tends to be reasonably well characterized in terms of 
applications and benchmarking, though the increasing use of Web-centric, inter- 
active applications poses new challenges in performance evaluation. 
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Servers 


As the shift to desktop computing occurred, the role of servers grew to provide 
larger-scale and more reliable file and computing services. The World Wide Web 
accelerated this trend because of the tremendous growth in the demand and 
sophistication of Web-based services. Such servers have become the backbone of 
large-scale enterprise computing, replacing the traditional mainframe. 

For servers, different characteristics are important. First, dependability is crit- 
ical. (We discuss dependability in Section 1.7.) Consider the servers running 
Google, taking orders for Cisco, or running auctions on eBay. Failure of such 
server systems is far more catastrophic than failure of a single desktop, since 
these servers must operate seven days a week, 24 hours a day. Figure 1.3 esti- 
mates revenue costs of downtime as of 2000. To bring costs up-to-date, Ama- 
zon.com had $2.98 billion in sales in the fall quarter of 2005. As there were about 
2200 hours in that quarter, the average revenue per hour was $1.35 million. Dur- 
ing a peak hour for Christmas shopping, the potential loss would be many times 
higher. 

Hence, the estimated costs of an unavailable system are high, yet Figure 13 
and the Amazon numbers are purely lost revenue and do not account for lost 
employee productivity or the cost of unhappy customers. 

A second key feature of server systems is scalability. Server systems often 
grow in response to an increasing demand for the services they support or an 
increase in functional requirements. Thus, the ability to scale up the computing 
capacity, the memory, the storage, and the I/O bandwidth of a server is crucial. 

Lastly, servers are designed for efficient throughput. That is, the overall per- 
formance of the server—in terms of transactions per minute or Web pages served 





Annual losses (millions of $) with downtime of 
































Cost of downtime per 1% 0.5% 0.1% 
Application hour (thousands of $) (87.6 hrs/yr) (43.8 hrs/yr) (8.8 hrs/yr) 
Brokerage operations $6450 $565 $283 $56.5 
Credit card authorization $2600 $228 $114 $22.8 
Package shipping services $150 $13 $6.6 $1.3 
Home shopping channel $113 $9.9 $4.9 $1.0 
Catalog sales center $90 $7.9 $3.9 $0.8 
Airline reservation center $89 $7.9 $3.9 $0.8 
Cellular service activation $41 $3.6 $1.8 $0.4 
Online network fees $25 $2.2 $1.1 $0.2 
ATM service fees $14 $1.2 $0.6 $0.1 





Figure 1.3 The cost of an unavailable system is shown by analyzing the cost of downtime (in terms of immedi- 
ately lost revenue), assuming three different levels of availability, and that downtime is distributed uniformly. 
These data are from Kembel [2000] and were collected and analyzed by Contingency Planning Research. 
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per second—is what is crucial. Responsiveness to an individual request remains 
important, but overall efficiency and cost-effectiveness, as determined by how 
many requests can be handled in a unit time, are the key metrics for most servers. 
We return to the issue of assessing performance for different types of computing 
environments in Section 1.8. 

A related category is supercomputers. They are the most expensive comput- 
ers, costing tens of millions of dollars, and they emphasize floating-point perfor- 
mance. Clusters of desktop computers, which are discussed in Appendix H, have 
largely overtaken this class of computer. As clusters grow in popularity, the num- 
ber of conventional supercomputers is shrinking, as are the number of companies 
who make them. 


Embedded Computers 


Embedded computers are the fastest growing portion of the computer market. 
These devices range from everyday machines—most microwaves, most washing 
machines, most printers, most networking switches, and all cars contain simple 
embedded microprocessors—to handheld digital devices, such as cell phones and 
smart cards, to video games and digital set-top boxes. 

Embedded computers have the widest spread of processing power and cost. 
They include 8-bit and 16-bit processors that may cost less than a dime, 32-bit 
microprocessors that execute 100 million instructions per second and cost under 
$5, and high-end processors for the newest video games or network switches that 
cost $100 and can execute a billion instructions per second. Although the range 
of computing power in the embedded computing market is very large, price is a 
key factor in the design of computers for this space. Performance requirements 
do exist, of course, but the primary goal is often meeting the performance need at 
a minimum price, rather than achieving higher performance at a higher price. 

Often, the performance requirement in an embedded application is real-time 
execution. A real-time performance requirement is when a segment of the appli- 
cation has an absolute maximum execution time. For example, in a digital set-top 
box, the time to process each video frame is limited, since the processor must 
accept and process the next frame shortly. In some applications, a more nuanced 
requirement exists: the average time for a particular task is constrained as well as 
the number of instances when some maximum time is exceeded. Such 
approaches—sometimes called soft real-time—arise when it is possible to occa- 
sionally miss the time constraint on an event, as long as not too many are missed. 
Real-time performance tends to be highly application dependent. 

Two other key characteristics exist in many embedded applications: the need 
to minimize memory and the need to minimize power. In many embedded appli- 
cations, the memory can be a substantial portion of the system cost, and it is 
important to optimize memory size in such cases. Sometimes the application is 
expected to fit totally in the memory on the processor chip; other times the 
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application needs to fit totally in a small off-chip memory. In any event, the 
importance of memory size translates to an emphasis on code size, since data size 
is dictated by the application. 

Larger memories also mean more power, and optimizing power is often criti- 
cal in embedded applications. Although the emphasis on low power is frequently 
driven by the use of batteries, the need to use less expensive packaging—plastic 
versus ceramic—and the absence of a fan for cooling also limit total power con- 
sumption. We examine the issue of power in more detail in Section 1.5. 

Most of this book applies to the design, use, and performance of embedded 
processors, whether they are off-the-shelf microprocessors or microprocessor 
cores, which will be assembled with other special-purpose hardware. 

Indeed, the third edition of this book included examples from embedded 
computing to illustrate the ideas in every chapter. Alas, most readers found these 
examples unsatisfactory, as the data that drives the quantitative design and evalu- 
ation of desktop and server computers has not yet been extended well to embed- 
ded computing (see the challenges with EEMBC, for example, in Section 1.8). 
Hence, we are left for now with qualitative descriptions, which do not fit well 
with the rest of the book. As a result, in this edition we consolidated the embed- 
ded material into a single appendix. We believe this new appendix (Appendix D) 
improves the flow of ideas in the text while still allowing readers to see how the 
differing requirements affect embedded computing. 


Defining Computer Architecture 


The task the computer designer faces is a complex one: Determine what 
attributes are important for a new computer, then design a computer to maximize 
performance while staying within cost, power, and availability constraints. This 
task has many aspects, including instruction set design, functional organization, 
logic design, and implementation. The implementation may encompass inte- 
grated circuit design, packaging, power, and cooling. Optimizing the design 
requires familiarity with a very wide range of technologies, from compilers and 
operating systems to logic design and packaging. 

In the past, the term computer architecture often referred only to instruction 
set design. Other aspects of computer design were called implementation, often 
insinuating that implementation is uninteresting or less challenging. 

We believe this view is incorrect. The architect's or designer's job is much 
more than instruction set design, and the technical hurdles in the other aspects of 
the project are likely more challenging than those encountered in instruction set 
design. We'll quickly review instruction set architecture before describing the 
larger challenges for the computer architect. 


Instruction Set Architecture 


We use the term instruction set architecture (ISA) to refer to the actual programmer- 
visible instruction set in this book. The ISA serves as the boundary between the 
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software and hardware. This quick review of ISA will use examples from MIPS 
and 80x86 to illustrate the seven dimensions of an ISA. Appendices B and J give 
more details on MIPS and the 80x86 ISAs. 


1. Class of ISA—Nearly all ISAs today are classified as general-purpose register 
architectures, where the operands are either registers or memory locations. 
The 80x86 has 16 general-purpose registers and 16 that can hold floating- 
point data, while MIPS has 32 general-purpose and 32 floating-point registers 
(see Figure 1.4). The two popular versions of this class are register-memory 
ISAs such as the 80x86, which can access memory as part of many instruc- 
tions, and load-store ISAs such as MIPS, which can access memory only 
with load or store instructions. All recent ISAs are load-store. 


2. Memory addressing—Virtually all desktop and server computers, including 
the 80x86 and MIPS, use byte addressing to access memory operands. Some 
architectures, like MIPS, require that objects must be aligned. An access to an 
object of size s bytes at byte address A is aligned if A mod s = 0. (See Figure 
B.5 on page B-9.) The 80x86 does not require alignment, but accesses are 
generally faster if operands are aligned. 


3. Addressing modes—In addition to specifying registers and constant operands, 
addressing modes specify the address of a memory object. MIPS addressing 
modes are Register, Immediate (for constants), and Displacement, where a 
constant offset is added to a register to form the memory address. The 80x86 
supports those three plus three variations of displacement: no register (abso- 
lute), two registers (based indexed with displacement), two registers where 












































Name Number Use Preserved across a call? 
$zero 0 The constant value 0 N.A. 
$at 1 Assembler temporary No 
$vO-$vl 2-3 Values for function results and No 
expression evaluation 
$a0-$a3 4-7 Arguments No 
$tO-$t7 8-15 Temporaries No 
$s0-$s7 16-23 Saved temporaries Yes 
$t8-$t9 24-25 Temporaries No 
$kO-$k] 26-27 Reserved for OS kernel No 
$9P 28 Global pointer Yes 
$sp 29 Stack pointer Yes 
$fp 30 Frame pointer Yes 
$ra 31 Return address Yes 





Figure 1.4 MIPS registers and usage conventions. In addition to the 32 general- 
purpose registers (RO-R31), MIPS has 32 floating-point registers (FO-F31) that can hold 
either a 32-bit single-precision number or a 64-bit double-precision number. 
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one register is multiplied by the size of the operand in bytes (based with 
scaled index and displacement). It has more like the last three, minus the dis- 
placement field: register indirect, indexed, and based with scaled index. 


Types and sizes of operands—Like most ISAs, MIPS and 80x86 support 
operand sizes of 8-bit (ASCII character), 16-bit (Unicode character or half 
word), 32-bit (integer or word), 64-bit (double word or long integer), and 
IEEE 754 floating point in 32-bit (single precision) and 64-bit (double pre- 
cision). The 80x86 also supports 80-bit floating point (extended double 
precision). 


Operations—The general categories of operations are data transfer, arith- 
metic logical, control (discussed next), and floating point. MIPS is a simple 
and easy-to-pipeline instruction set architecture, and it is representative of the 
RISC architectures being used in 2006. Figure 15 summarizes the MIPS ISA. 
The 80x86 has a much richer and larger set of operations (see Appendix J). 


Control flow instructions—Virtually all ISAs, including 80x86 and MIPS, 
support conditional branches, unconditional jumps, procedure calls, and 
returns. Both use PC-relative addressing, where the branch address is speci- 
fied by an address field that is added to the PC. There are some small differ- 
ences. MIPS conditional branches (BE, BNE, etc.) test the contents of registers, 
while the 80x86 branches (JE, JNE, etc.) test condition code bits set as side 
effects of arithmetic/logic operations. MIPS procedure call (JAL) places the 
return address in a register, while the 80x86 call (CALLF) places the return 
address on a stack in memory. 


. Encoding an IsA—There are two basic choices on encoding: fixed length and 


variable length. All MIPS instructions are 32 bits long, which simplifies 
instruction decoding. Figure 1.6 shows the MIPS instruction formats. The 
80x86 encoding is variable length, ranging from 1 to 18 bytes. Variable- 
length instructions can take less space than fixed-length instructions, so a pro- 
gram compiled for the 80x86 is usually smaller than the same program com- 
piled for MIPS. Note that choices mentioned above will affect how the 
instructions are encoded into a binary representation. For example, the num- 
ber of registers and the number of addressing modes both have a significant 
impact on the size of instructions, as the register field and addressing mode 
field can appear many times in a single instruction. 


The other challenges facing the computer architect beyond ISA design are 


particularly acute at the present, when the differences among instruction sets are 
small and when there are distinct application areas. Therefore, starting with this 
edition, the bulk of instruction set material beyond this quick review is found in 
the appendices (see Appendices B and J). 


We use a subset of MIPS64 as the example ISA in this book. 
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Instruction type/opcode 


Instruction meaning 





Data transfers 


LB, LBU, SB 

LH, LHU, SH 

LW, LWU, SW 

LD, SD 
L.S,L.D,S.S,S.D 
MFCO, MICO 
MOV.S, MOV.D 
MFC1,MTC1 


Move data between registers and memory, or between the integer and FP or special 
registers; only memory address mode is 16-bit displacement + contents of a GPR 
Load byte, load byte unsigned, store byte (to/from integer registers) 

Load half word, load half word unsigned, store half word (to/from integer registers) 
Load word, load word unsigned, store word (to/from integer registers) 

Load double word, store double word (to/from integer registers) 

Load SP float, load DP float, store SP float, store DP float 

Copy from/to GPR to/from a special register 

Copy one SP or DP FP register to another FP register 

Copy 32 bits to/from FP registers from/to integer registers 





Arithmetic/logical 
DADD, DADDI,DADDU, DADDIU 
DSUB, DSUBU 


DMUL,DMULU,DDIV, 
DDIVU, MADD 


AND, ANDI 
OR, ORI,XOR, XORI 
LUI 


DSLL, DSRL, DSRA, DSLLV, 
DSRLV, DSRAV 


SLT, SLTI, SLTU, SLTIU 


Operations on integer or logical data in GPRs; signed arithmetic trap on overflow 
Add, add immediate (all immediates are 16 bits); signed and unsigned 
Subtract; signed and unsigned 


Multiply and divide, signed and unsigned; multiply-add; all operations take and yield 
64-bit values 


And, and immediate 
Or, or immediate, exclusive or, exclusive or immediate 
Load upper immediate; loads bits 32 to 47 of register with immediate, then sign-extends 


Shifts: both immediate (DS ) and variable form (DS V); shifts are shift left logical, 
right logical, right arithmetic 


Set less than, set less than immediate; signed and unsigned 





Control 
BEQZ, BNEZ 
BEQ, BNE 
BCIT, BCIF 
MOVN, MOVZ 
J, JR 

JAL, JALR 
TRAP 

ERET 


Conditional branches and jumps; PC-relative or through register 

Branch GPRs equal/not equal to zero; 16-bit offset from PC + 4 

Branch GPR equal/not equal; 16-bit offset from PC + 4 

Test comparison bit in the FP status register and branch; 16-bit offset from PC + 4 
Copy GPR to another GPR if third GPR is negative, zero 

Jumps: 26-bit offset from PC + 4 (J) or target in register (JR) 

Jump and link: save PC + 4 in R31, target is PC-relative (JAL) or a register JALR) 
Transfer to operating system at a vectored address 


Return to user code from an exception; restore user mode 





Floating point 

ADD.D, ADD.S, ADD.PS 
SUB.D, SUB.S, SUB.PS 
MUL.D, MUL.S, MUL.PS 
MADD.D, MADD.S, MADD.PS 
DIV.D, DIV.S, DIV.PS 
CVT._. 


Cui VD; Ces 


FP operations on DP and SP formats 

Add DP, SP numbers, and pairs of SP numbers 

Subtract DP, SP numbers, and pairs of SP numbers 
Multiply DP, SP floating point, and pairs of SP numbers 
Multiply-add DP, SP numbers, and pairs of SP numbers 
Divide DP, SP floating point, and pairs of SP numbers 


Convert instructions: CVT. x .y converts from type x to type y, where x and y are L 
(64-bit integer), W (32-bit integer), D (DP), or S (SP). Both operands are FPRs. 


DP and SP compares: "__" = LT,GTLEGEFQ,NẸ, sets bit in FP status register 





Figure 1.5 Subset of the instructions in MIPS64. SP = single precision; DP = double precision. Appendix B gives 
much more detail on MIPS64. For data, the most significant bit number is 0; least is 63. 
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Basic instruction formats 
































R opcode rs rt rd shamt funct 

31 26 25 21 20 16 15 11 10 65 0 
1 opcode rs rt immediate 

31 26 25 21 20 16 15 
J opcode address 

31 26 25 


Floating-point instruction formats 











FR opcode fmt ft fs fd funct 





31 26 25 21 20 16 15 11 10 65 0 





FI opcode fmt ft immediate 




















26 25 21 20 16 15 


Figure 1.6 MIPS64 instruction set architecture formats. All instructions are 32 bits 
long.The R format is for integer register-to-register operations, such as DADDU, DSUBU, 
and so on.The | format is for data transfers, branches, and immediate instructions, such 
as LD, SD, BEQZ, and DADDIs.The J format is for jumps, the FR format for floating point 
operations, and the FI format for floating point branches. 


The Rest of Computer Architecture: Designing the 
Organization and Hardware to Meet Goals and 
Functional Requirements 


The implementation of a computer has two components: organization and 
hardware. The term organization includes the high-level aspects of a computer's 
design, such as the memory system, the memory interconnect, and the design of 
the internal processor or CPU (central processing unit—where arithmetic, logic, 
branching, and data transfer are implemented). For example, two processors with 
the same instruction set architectures but very different organizations are the 
AMD Opteron 64 and the Intel Pentium 4. Both processors implement the x86 
instruction set, but they have very different pipeline and cache organizations. 

Hardware refers to the specifics of a computer, including the detailed logic 
design and the packaging technology of the computer. Often a line of computers 
contains computers with identical instruction set architectures and nearly identi- 
cal organizations, but they differ in the detailed hardware implementation. For 
example, the Pentium 4 and the Mobile Pentium 4 are nearly identical, but offer 
different clock rates and different memory systems, making the Mobile Pentium 
4 more effective for low-end computers. 

In this book, the word architecture covers all three aspects of computer 
design—instruction set architecture, organization, and hardware. 

Computer architects must design a computer to meet functional requirements 
as well as price, power, performance, and availability goals. Figure 1.7 summa- 
rizes requirements to consider in designing a new computer. Often, architects 
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Functional requirements 


Typical features required or supported 





Application area 


General-purpose desktop 


Scientific desktops and servers 


Commercial servers 


Embedded computing 


Target of computer 


Balanced performance for a range of tasks, including interactive performance for 
graphics, video, and audio (Ch. 2, 3, 5, App. B) 

High-performance floating point and graphics (App. D 

Support for databases and transaction processing; enhancements for reliability and 
availability; support for scalability (Ch. 4, App. B, E) 


Often requires special support for graphics or video (or other application-specific 
extension); power limitations and power control may be required (Ch. 2, 3,5, App. 
B) 





Level of software compatibility 
At programming language 


Object code or binary 
compatible 


Determines amount of existing software for computer 
Most flexible for designer; need new compiler (Ch. 4, App. B) 


Instruction set architecture is completely defined—little flexibility—but no 
investment needed in software or porting programs 





Operating system requirements 
Size of address space 
Memory management 


Protection 


Necessary features to support chosen OS (Ch. 5, App. E) 
Very important feature (Ch. 5); may limit applications 
Required for modern OS; may be paged or segmented (Ch. 5) 


Different OS and application needs: page vs. segment; virtual machines (Ch. 5) 





Standards 
Floating point 


VO interfaces 
Operating systems 
Networks 


Programming languages 


Certain standards may be required by marketplace 


Format and arithmetic: IEEE 754 standard (App. I), special arithmetic for graphics 
or signal processing 


For I/O devices: Serial ATA, Serial Attach SCSI, PCI Express (Ch. 6, App. E) 
UNIX, Windows, Linux, CISCO IOS 

Support required for different networks: Ethernet, Infiniband (App. E) 
Languages (ANSI C, C++, Java, FORTRAN) affect instruction set (App. B) 





Figure 1.7 Summary of some of the most important functional requirements an architect faces.The left-hand 
column describes the class of requirement, while the right-hand column gives specific examples.The right-hand col- 
umn also contains references to chapters and appendices that deal with the specific issues. 


also must determine what the functional requirements are, which can be a major 
task. The requirements may be specific features inspired by the market. Applica- 
tion software often drives the choice of certain functional requirements by deter- 
mining how the computer will be used. If a large body of software exists for a 
certain instruction set architecture, the architect may decide that a new computer 
should implement an existing instruction set. The presence of a large market for a 
particular class of applications might encourage the designers to incorporate 
requirements that would make the computer competitive in that market. Many of 
these requirements and features are examined in depth in later chapters. 


Architects must also be aware of important trends in both the technology and 


the use of computers, as such trends not only affect future cost, but also the lon- 


gevity of an architecture. 
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Trends in Technology 


If an instruction set architecture is to be successful, it must be designed to survive 
rapid changes in computer technology. After all, a successful new instruction set 
architecture may last decades—for example, the core of the IBM mainframe has 
been in use for more than 40 years. An architect must plan for technology 
changes that can increase the lifetime of a successful computer. 

To plan for the evolution of a computer, the designer must be aware of rapid 
changes in implementation technology. Four implementation technologies, which 
change at a dramatic pace, are critical to modern implementations: 


e Integrated circuit logic technology—Transistor density increases by about 
35% per year, quadrupling in somewhat over four years. Increases in die size 
are less predictable and slower, ranging from 10% to 20% per year. The com- 
bined effect is a growth rate in transistor count on a chip of about 40% to 55% 
per year. Device speed scales more slowly, as we discuss below. 


e Semiconductor DRAM (dynamic random-access memory)—Capacity 
increases by about 40% per year, doubling roughly every two years. 


e Magnetic disk technology—Prior to 1990, density increased by about 30% 
per year, doubling in three years. It rose to 60% per year thereafter, and 
increased to 100% per year in 1996. Since 2004, it has dropped back to 
30% per year. Despite this roller coaster of rates of improvement, disks are 
still 50-100 times cheaper per bit than DRAM. This technology is central to 
Chapter 6, and we discuss the trends in detail there. 


e Network technology—Network performance depends both on the perfor- 
mance of switches and on the performance of the transmission system. We 
discuss the trends in networking in Appendix E. 


These rapidly changing technologies shape the design of a computer that, 
with speed and technology enhancements, may have a lifetime of five or more 
years. Even within the span of a single product cycle for a computing system 
(two years of design and two to three years of production), key technologies such 
as DRAM change sufficiently that the designer must plan for these changes. 
Indeed, designers often design for the next technology, knowing that when a 
product begins shipping in volume that next technology may be the most cost- 
effective or may have performance advantages. Traditionally, cost has decreased 
at about the rate at which density increases. 

Although technology improves continuously, the impact of these improve- 
ments can be in discrete leaps, as a threshold that allows a new capability is 
reached. For example, when MOS technology reached a point in the early 1980s 
where between 25,000 and 50,000 transistors could fit on a single chip, it became 
possible to build a single-chip, 32-bit microprocessor. By the late 1980s, first- 
level caches could go on chip. By eliminating chip crossings within the processor 
and between the processor and the cache, a dramatic improvement in cost- 
performance and power-performance was possible. This design was simply infea- 
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sible until the technology reached a certain point. Such technology thresholds are 
not rare and have a significant impact on a wide variety of design decisions. 


Performance Trends: Bandwidth over Latency 


As we shall see in Section 1.8, bandwidth or throughput is the total amount of 
work done in a given time, such as megabytes per second for a disk transfer. In 
contrast, latency or response time is the time between the start and the comple- 
tion of an event, such as milliseconds for a disk access. Figure 1.8 plots the rela- 
tive improvement in bandwidth and latency for technology milestones for 
microprocessors, memory, networks, and disks. Figure 1.9 describes the exam- 
ples and milestones in more detail. Clearly, bandwidth improves much more rap- 
idly than latency. 

Performance is the primary differentiator for microprocessors and networks, 
so they have seen the greatest gains: 1000-2000X in bandwidth and 20-40X in 
latency. Capacity is generally more important than performance for memory and 
disks, so capacity has improved most, yet their bandwidth advances of 120-140X 
are still much greater than their gains in latency of 4-8X. Clearly, bandwidth has 
outpaced latency across these technologies and will likely continue to do so. 

A simple rule of thumb is that bandwidth grows by at least the square of the 
improvement in latency. Computer designers should make plans accordingly. 
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Figure 1.8 Log-log plot of bandwidth and latency milestones from Figure 1.9 rela- 
tive to the first milestone. Note that latency improved about 10X while bandwidth 
improved about 100Xto 1000X.From Patterson [2004]. 
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Microprocessor 16-bit 32-bit 5-stage 2-way Out-of-order Out-of-order 
address/bus, address.bus, pipeline, superscalar, 3-way superpipelined, 
microcoded microcoded on-chip I & D 64-bit bus superscalar on-chip 1.2 
caches, FPU cache 
Product Intel 80286 Intel 80386 Intel 80486 Intel Pentium Intel Pentium Pro Intel Pentium 4 
Year 1982 1985 1989 1993 1997 2001 
Die size (mmô) 47 43 81 90 308 217 
Transistors 134,000 275,000 1,200,000 3,100,000 5,500,000 42,000,000 
Pins 68 132 168 273 387 423 
Latency (clocks) 6 5 5 5 10 22 
Bus width (bits) 16 32 32 64 64 64 
Clock rate (MHz) 12.5 16 25 66 200 1500 
Bandwidth (MIPS) 2 6 25 132 600 4500 
Latency (ns) 320 313 200 76 50 15 
Memory module DRAM Page mode Fast page Fast page Synchronous Double data 
DRAM mode DRAM mode DRAM DRAM rate SDRAM 
Module width (bits) 16 16 32 64 64 64 
Year 1980 1983 1986 1993 1997 2000 
Mbits/DRAM chip 0.06 0.25 1 16 64 256 
Die size (mmô) 35 45 70 130 170 204 
Pins/DRAM chip 16 16 18 20 54 66 
Bandwidth (MBit/sec) 13 40 160 267 640 1600 
Latency (ns) 225 170 125 75 62 52 
Local area network Ethernet Fast Ethernet Gigabit 10 Gigabit 
Ethernet Ethernet 
IEFE standard 802.3 803.3u 802.3ab 802.3ac 
Year 1978 1995 1999 2003 
Bandwidth (MBit/sec) 10 100 1000 10000 
Latency (usee) 3000 500 340 190 
Hard disk 3600 RPM 5400 RPM 7200 RPM 10,000 RPM 15,000 RPM 
Product CDCWrenl Seagate Seagate Seagate Seagate 
94145-36 ST41600 ST15150 ST39102 ST373453 
Year 1983 1990 1994 1998 2003 
Capacity (GB) 0.03 14 4.3 9.1 73.4 
Disk form factor 5.25 inch 5.25 inch 3.5 inch 3.5 inch 3.5 inch 
Media diameter 5.25 inch 5.25 inch 3.5 inch 3.0 inch 2.5 inch 
Interface ST-412 SCSI SCSI SCSI SCSI 
Bandwidth (MBit/sec) 0.6 4 9 24 86 
Latency (ms) 48.3 17.1 12.7 8.8 5.7 





Figure 1.9 Performance milestones over 20 to 25 years for microprocessors, memory, networks, and disks.The 
microprocessor milestones are six generations of IA-32 processors, going from a 16-bit bus, microcoded 80286 to a 
64-bit bus, superscalar, out-of-order execution, superpipelined Pentium 4. Memory module milestones go from 16- 
bit-wide, plain DRAM to 64-bit-wide double data rate synchronous DRAM. Ethernet advanced from 10 Mb/sec to 10 
Gb/sec. Disk milestones are based on rotation speed, improving from 3600 RPM to 15,000 RPM. Each case is best- 
case bandwidth, and latency is the time for a simple operation assuming no contention. From Patterson [2004]. 
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Scaling of Transistor Performance and Wires 


Integrated circuit processes are characterized by the feature size, which is the 
minimum size of a transistor or a wire in either the x or y dimension. Feature 
sizes have decreased from 10 microns in 1971 to 0.09 microns in 2006; in fact, 
we have switched units, so production in 2006 is now referred to as "90 nanome- 
ters," and 65 nanometer chips are underway. Since the transistor count per square 
millimeter of silicon is determined by the surface area of a transistor, the density 
of transistors increases quadratically with a linear decrease in feature size. 

The increase in transistor performance, however, is more complex. As feature 
sizes shrink, devices shrink quadratically in the horizontal dimension and also 
shrink in the vertical dimension. The shrink in the vertical dimension requires a 
reduction in operating voltage to maintain correct operation and reliability of the 
transistors. This combination of scaling factors leads to a complex interrelation- 
ship between transistor performance and process feature size. To a first approxi- 
mation, transistor performance improves linearly with decreasing feature size. 

The fact that transistor count improves quadratically with a linear improve- 
ment in transistor performance is both the challenge and the opportunity for 
which computer architects were created! In the early days of microprocessors, 
the higher rate of improvement in density was used to move quickly from 4-bit, 
to 8-bit, to 16-bit, to 32-bit microprocessors. More recently, density improve- 
ments have supported the introduction of 64-bit microprocessors as well as many 
of the innovations in pipelining and caches found in Chapters 2, 3, and 5. 

Although transistors generally improve in performance with decreased fea- 
ture size, wires in an integrated circuit do not. In particular, the signal delay for a 
wire increases in proportion to the product of its resistance and capacitance. Of 
course, as feature size shrinks, wires get shorter, but the resistance and capaci- 
tance per unit length get worse. This relationship is complex, since both resis- 
tance and capacitance depend on detailed aspects of the process, the geometry of 
a wire, the loading on a wire, and even the adjacency to other structures. There 
are occasional process enhancements, such as the introduction of copper, which 
provide one-time improvements in wire delay. 

In general, however, wire delay scales poorly compared to transistor perfor- 
mance, creating additional challenges for the designer. In the past few years, wire 
delay has become a major design limitation for large integrated circuits and is 
often more critical than transistor switching delay. Larger and larger fractions of 
the clock cycle have been consumed by the propagation delay of signals on wires. 
In 2001, the Pentium 4 broke new ground by allocating 2 stages of its 20+-stage 
pipeline just for propagating signals across the chip. 


Trends in Power in Integrated Circuits 


Power also provides challenges as devices are scaled. First, power must be 
brought in and distributed around the chip, and modern microprocessors use 
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Example 


Answer 


hundreds of pins and multiple interconnect layers for just power and ground. Sec- 
ond, power is dissipated as heat and must be removed. 

For CMOS chips, the traditional dominant energy consumption has been in 
switching transistors, also called dynamic power. The power required per transis- 
tor is proportional to the product of the load capacitance of the transistor, the 
square of the voltage, and the frequency of switching, with watts being the unit: 


Powefaynamic = 1/2 x Capacitive load x Voltage x Frequency switched 


Mobile devices care about battery life more than power, so energy is the proper 
metric, measured in joules: 


Energy dynamic = Capacitive load x Voltage 


Hence, dynamic power and energy are greatly reduced by lowering the volt- 
age, and so voltages have dropped from 5V to just over IV in 20 years. The 
capacitive load is a function of the number of transistors connected to an output 
and the technology, which determines the capacitance of the wires and the tran- 
sistors. For a fixed task, slowing clock rate reduces power, but not energy. 


Some microprocessors today are designed to have adjustable voltage, so that a 
15% reduction in voltage may result in a 15% reduction in frequency. What 
would be the impact on dynamic power? 


Since the capacitance is unchanged, the answer is the ratios of the voltages and 
frequencies: 


POWET aew _ (Voltage x 0.85) x (Frequency switched x 0.85) _ 
Power 


0.85° = 0.61 





old Voltage” x Frequency switched 


thereby reducing power to about 60% of the original. 


As we move from one process to the next, the increase in the number of 
transistors switching, and the frequency with which they switch, dominates the 
decrease in load capacitance and voltage, leading to an overall growth in power 
consumption and energy. The first microprocessors consumed tenths of a watt, 
while a 3.2 GHz Pentium 4 Extreme Edition consumes 135 watts. Given that 
this heat must be dissipated from a chip that is about 1 cm on a side, we are 
reaching the limits of what can be cooled by air. Several Intel microprocessors 
have temperature diodes to reduce activity automatically if the chip gets too 
hot. For example, they may reduce voltage and clock frequency or the instruc- 
tion issue rate. 

Distributing the power, removing the heat, and preventing hot spots have 
become increasingly difficult challenges. Power is now the major limitation to 
using transistors; in the past it was raw silicon area. As a result of this limitation, 
most microprocessors today turn off the clock of inactive modules to save energy 
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and dynamic power. For example, if no floating-point instructions are executing, 
the clock of the floating-point unit is disabled. 

Although dynamic power is the primary source of power dissipation in 
CMOS, static power is becoming an important issue because leakage current 
flows even when a transistor is off: 


Power = Current x Voltage 


static static 


Thus, increasing the number of transistors increases power even if they are turned 
off, and leakage current increases in processors with smaller transistor sizes. As a 
result, very low power systems are even gating the voltage to inactive modules to 
control loss due to leakage. In 2006, the goal for leakage is 25% of the total 
power consumption, with leakage in high-performance designs sometimes far 
exceeding that goal. As mentioned before, the limits of air cooling have led to 
exploration of multiple processors on a chip running at lower voltages and clock 
rates. 


Trends in Cost 


Although there are computer designs where costs tend to be less important— 
specifically supercomputers—cost-sensitive designs are of growing significance. 
Indeed, in the past 20 years, the use of technology improvements to lower cost, as 
well as increase performance, has been a major theme in the computer industry. 

Textbooks often ignore the cost half of cost-performance because costs 
change, thereby dating books, and because the issues are subtle and differ across 
industry segments. Yet an understanding of cost and its factors is essential for 
designers to make intelligent decisions about whether or not a new feature should 
be included in designs where cost is an issue. (Imagine architects designing sky- 
scrapers without any information on costs of steel beams and concrete!) 

This section discusses the major factors that influence the cost of a computer 
and how these factors are changing over time. 


The Impact of Time, Volume, and Commodifrcation 


The cost of a manufactured computer component decreases over time even with- 
out major improvements in the basic implementation technology. The underlying 
principle that drives costs down is the learning curve—manufacturing costs 
decrease over time. The learning curve itself is best measured by change in 
yield—the percentage of manufactured devices that survives the testing proce- 
dure. Whether it is a chip, a board, or a system, designs that have twice the yield 
will have half the cost. 

Understanding how the learning curve improves yield is critical to projecting 
costs over a product's life. One example is that the price per megabyte of DRAM 
has dropped over the long term by 40% per year. Since DRAMsS tend to be priced 
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in close relationship to cost—with the exception of periods when there is a short- 
age or an oversupply—price and cost of DRAM track closely. 

Microprocessor prices also drop over time, but because they are less stan- 
dardized than DRAMs, the relationship between price and cost is more complex. 
In a period of significant competition, price tends to track cost closely, although 
microprocessor vendors probably rarely sell at a loss. Figure 1.10 shows proces- 
sor price trends for Intel microprocessors. 

Volume is a second key factor in determining cost. Increasing volumes affect 
cost in several ways. First, they decrease the time needed to get down the learning 
curve, which is partly proportional to the number of systems (or chips) manufac- 
tured. Second, volume decreases cost, since it increases purchasing and manu- 
facturing efficiency. As a rule of thumb, some designers have estimated that cost 
decreases about 10% for each doubling of volume. Moreover, volume decreases 
the amount of development cost that must be amortized by each computer, thus 
allowing cost and selling price to be closer. 
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Figure 1.10 The price of an Intel Pentium 4 and Pentium M at a given frequency 
decreases over time as yield enhancements decrease the cost of a good die and 
competition forces price reductions.The most recent introductions will continue to 
decrease until they reach similar prices to the lowest-cost parts available today ($200). 
Such price decreases assume a competitive environment where price decreases track 
cost decreases closely. Data courtesy of Microprocessor Report, May 2005. 
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Commodities are products that are sold by multiple vendors in large volumes 
and are essentially identical. Virtually all the products sold on the shelves of gro- 
cery stores are commodities, as are standard DRAMs, disks, monitors, and key- 
boards. In the past 15 years, much of the low end of the computer business has 
become a commodity business focused on building desktop and laptop computers 
running Microsoft Windows. 

Because many vendors ship virtually identical products, it is highly competi- 
tive. Of course, this competition decreases the gap between cost and selling price, 
but it also decreases cost. Reductions occur because a commodity market has 
both volume and a clear product definition, which allows multiple suppliers to 
compete in building components for the commodity product. As a result, the 
overall product cost is lower because of the competition among the suppliers of 
the components and the volume efficiencies the suppliers can achieve. This has 
led to the low end of the computer business being able to achieve better price- 
performance than other sectors and yielded greater growth at the low end, 
although with very limited profits (as is typical in any commodity business). 


Cost of an Integrated Circuit 


Why would a computer architecture book have a section on integrated circuit 
costs? In an increasingly competitive computer marketplace where standard 
parts—disks, DRAMs, and so on—are becoming a significant portion of any sys- 
tem's cost, integrated circuit costs are becoming a greater portion of the cost that 
varies between computers, especially in the high-volume, cost-sensitive portion 
of the market. Thus, computer designers must understand the costs of chips to 
understand the costs of current computers. 

Although the costs of integrated circuits have dropped exponentially, the 
basic process of silicon manufacture is unchanged: A wafer is still tested and 
chopped into dies that are packaged (see Figures 1.11 and 1.12). Thus the cost of 
a packaged integrated circuit is 


a .. _ Cost of die + Cost of testing die + Cost of packaging and final test 
of integrated circuit = = = 





Final test yield 


In this section, we focus on the cost of dies, summarizing the key issues in testing 
and packaging at the end. 

Learning how to predict the number of good chips per wafer requires first 
learning how many dies fit on a wafer and then learning how to predict the per- 
centage of those that will work. From there it is simple to predict cost: 


Cost of wafer 


Cost of die = — : = 
j © = Dies per wafer x Die yield 





The most interesting feature of this first term of the chip cost equation is its sensi- 
tivity to die size, shown below. 
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Example 


Answer 





Figure 1.11 Photograph of an AMD Opteron microprocessor die. (Courtesy AMD.) 


The number of dies per wafer is approximately the area of the wafer divided 
by the area of the die. It can be more accurately estimated by 


— n x (Wafer diameter/2)? n X Wafer diameter 
Dies per wafer = => A — 

Die area „2 x Die area 
The first term is the ratio of wafer area (nr) to die area. The second compensates 
for the "square peg in a round hole" problem—rectangular dies near the periph- 
ery of round wafers. Dividing the circumference (nd) by the diagonal of a square 
die is approximately the number of dies along the edge. 


Find the number of dies per 300 mm (30 cm) wafer for a die that is 15 cm on a 
side. 
The die area is 2.25 cm”. Thus 


mx (30/2) _ _ 106.9 94.2 _ »7 
2.25 5x225 2235 212° ~ 





Dies per wafer = 


However, this only gives the maximum number of dies per wafer. The critical 
question is: What is the fraction of good dies on a wafer number, or the die yield! 
A simple model of integrated circuit yield, which assumes that defects are ran- 
domly distributed over the wafer and that yield is inversely proportional to the 
complexity of the fabrication process, leads to the following: 


Defects per unit area x Die area" 


Die yield = Wafer yicld x ( 1+ L 
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Figure 1.12 This 300mm wafer contains 117 AMD Opteron chips implemented in a 90 nm process. (Courtesy 
AMD.) 


The formula is an empirical model developed by looking at the yield of many 
manufacturing lines. Wafer yield accounts for wafers that are completely bad and 
so need not be tested. For simplicity, we'll just assume the wafer yield is 100%. 
Defects per unit area is a measure of the random manufacturing defects that 
occur. In 2006, these value is typically 0.4 defects per square centimeter for 
90 nm, as it depends on the maturity of the process (recall the learning curve, 
mentioned earlier). Lastly, a is a parameter that corresponds roughly to the 
number of critical masking levels, a measure of manufacturing complexity. For 
multilevel metal CMOS processes in 2006, a good estimate is a = 4.0. 
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Example 


Answer 


Find the die yield for dies that are 1.5 cm on a side and 1.0 cm on a side, assum- 
ing a defect density of 0.4 per cm’ and a is 4. 


The total die areas are 2.25 cm? and 1.00 cm’. For the larger die, the yield is 


Z 995\-4 
Die yield = (1 a) = 0.44 
4 
For the smaller die, it is Die yield = (1 + 24x1) = 0.68 


That is, less than half of all the large die are good but more than two-thirds of the 
small die are good. 


The bottom line is the number of good dies per wafer, which comes from 
multiplying dies per wafer by die yield to incorporate the effects of defects. The 
examples above predict about 120 good 2.25 cm? dies from the 300 mm wafer 
and 435 good 1.00 cm? dies. Many 32-bit and 64-bit microprocessors in a mod- 
ern 90 nm technology fall between these two sizes. Low-end embedded 32-bit 
processors are sometimes as small as 0.25 cm’, and processors used for embed- 
ded control (in printers, automobiles, etc.) are often less than 0.1 cm’, 

Given the tremendous price pressures on commodity products such as 
DRAM and SRAM, designers have included redundancy as a way to raise yield. 
For a number of years, DRAMs have regularly included some redundant memory 
cells, so that a certain number of flaws can be accommodated. Designers have 
used similar techniques in both standard SRAMs and in large SRAM arrays used 
for caches within microprocessors. Obviously, the presence of redundant entries 
can be used to boost the yield significantly. 

Processing of a 300 mm (12-inch) diameter wafer in a leading-edge technol- 
ogy costs between $5000 and $6000 in 2006. Assuming a processed wafer cost of 
$5500, the cost of the 1.00 cm’ die would be around $13, but the cost per die of 
the 2.25 cm? die would be about $46, or almost four times the cost for a die that 
is a little over twice as large. 

What should a computer designer remember about chip costs? The manufac- 
turing process dictates the wafer cost, wafer yield, and defects per unit area, so 
the sole control of the designer is die area. In practice, because the number of 
defects per unit area is small, the number of good dies per wafer, and hence the 
cost per die, grows roughly as the square of the die area. The computer designer 
affects die size, and hence cost, both by what functions are included on or 
excluded from the die and by the number of I/O pins. 

Before we have a part that is ready for use in a computer, the die must be 
tested (to separate the good dies from the bad), packaged, and tested again after 
packaging. These steps all add significant costs. 

The above analysis has focused on the variable costs of producing a func- 
tional die, which is appropriate for high-volume integrated circuits. There is, 
however, one very important part of the fixed cost that can significantly affect the 
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cost of an integrated circuit for low volumes (less than | million parts), namely, 
the cost of a mask set. Each step in the integrated circuit process requires a sepa- 
rate mask. Thus, for modern high-density fabrication processes with four to six 
metal layers, mask costs exceed $1 million. Obviously, this large fixed cost 
affects the cost of prototyping and debugging runs and, for small-volume produc- 
tion, can be a significant part of the production cost. Since mask costs are likely 
to continue to increase, designers may incorporate reconfigurable logic to 
enhance the flexibility of a part, or choose to use gate arrays (which have fewer 
custom mask levels) and thus reduce the cost implications of masks. 


Cost versus Price 


With the commoditization of the computers, the margin between the cost to the 
manufacture a product and the price the product sells for has been shrinking. 
Those margins pay for a company's research and development (R&D), market- 
ing, sales, manufacturing equipment maintenance, building rental, cost of financ- 
ing, pretax profits, and taxes. Many engineers are surprised to find that most 
companies spend only 4% (in the commodity PC business) to 12% (in the high- 
end server business) of their income on R&D, which includes all engineering. 


Dependability 


Historically, integrated circuits were one of the most reliable components of a 
computer. Although their pins may be vulnerable, and faults may occur over 
communication channels, the error rate inside the chip was very low. That con- 
ventional wisdom is changing as we head to feature sizes of 65 nm and smaller, 
as both transient faults and permanent faults will become more commonplace, so 
architects must design systems to cope with these challenges. This section gives 
an quick overview of the issues in dependability, leaving the official definition of 
the terms and approaches to Section 6.3. 

Computers are designed and constructed at different layers of abstraction. We 
can descend recursively down through a computer seeing components enlarge 
themselves to full subsystems until we run into individual transistors. Although 
some faults are widespread, like the loss of power, many can be limited to a sin- 
gle component in a module. Thus, utter failure of a module at one level may be 
considered merely a component error in a higher-level module. This distinction is 
helpful in trying to find ways to build dependable computers. 

One difficult question is deciding when a system is operating properly. This 
philosophical point became concrete with the popularity of Internet services. 
Infrastructure providers started offering Service Level Agreements (SLA) or 
Service Level Objectives (SLO) to guarantee that their networking or power ser- 
vice would be dependable. For example, they would pay the customer a penalty if 
they did not meet an agreement more than some hours per month. Thus, an SLA 
could be used to decide whether the system was up or down. 
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Example 


Systems alternate between two states of service with respect to an SLA: 


1. Service accomplishment, where the service is delivered as specified 


2. Service interruption, where the delivered service is different from the SLA 


Transitions between these two states are caused by failures (from state 1 to state 
2) or restorations (2 to 1). Quantifying these transitions leads to the two main 
measures of dependability: 


Module reliability is a measure of the continuous service accomplishment (or, 
equivalently, of the time to failure) from a reference initial instant. Hence, the 
mean time to failure (MTTF) is a reliability measure. The reciprocal of 
MTTF is a rate of failures, generally reported as failures per billion hours of 
operation, or FIT (for failures in time).Thus, an MTTF of 1,000,000 hours 
equals 10/10 or 1000 FIT. Service interruption is measured as mean time to 
repair (MTTR). Mean time between failures (MTBF) is simply the sum of 
MTTF + MTTR. Although MTBF is widely used, MTTF is often the more 
appropriate term. If a collection of modules have exponentially distributed 
lifetimes—meaning that the age of a module is not important in probability of 
failure—the overall failure rate of the collection is the sum of the failure rates 
of the modules. 


Module availability is a measure of the service accomplishment with respect 
to the alternation between the two states of accomplishment and interruption. 
For nonredundant systems with repair, module availability is 


MTTF 


N availability = 
Aodule availability (MITE + MTIR) 





Note that reliability and availability are now quantifiable metrics, rather than syn- 
onyms for dependability. From these definitions, we can estimate reliability of a 
system quantitatively if we make some assumptions about the reliability of com- 
ponents and that failures are independent. 


Assume a disk subsystem with the following components and MTTF: 


10 disks, each rated at 1,000,000-hour MTTF 
1 SCSI controller, 500,000-hour MTTF 

1 power supply, 200,000-hour MTTF 

1 fan, 200,000-hour MTTF 

1 SCSI cable, 1,000,000-hour MTTF 


Using the simplifying assumptions that the lifetimes are exponentially distributed 
and that failures are independent, compute the MTTF of the system as a whole. 


Answer 


Example 


Answer 


MTTI 
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The sum of the failure rates is 
Failure rate = a a a Cc: ee, O. 
system ~ 1,000,000 500,000 200,000 200,000 — 1,000.000 
— 1042454541 _ 23 23,000 


~ 1,000,000 hours ~ 1,000,000 2 1,000,000,000 hours 
or 23,000 FIT. The MTTF for the system is just the inverse of the failure rate: 





MTTE z l = 1,000.000,000 hours = 43.500 hours 


system Failure rate. om 23,000 





or just under 5 years. 


The primary way to cope with failure is redundancy, either in time (repeat the 
operation to see if it still is erroneous) or in resources (have other components to 
take over from the one that failed). Once the component is replaced and the sys- 
tem fully repaired, the dependability of the system is assumed to be as good as 
new. Let's quantify the benefits of redundancy with an example. 


Disk subsystems often have redundant power supplies to improve dependability. 
Using the components and MTTFs from above, calculate the reliability of a 
redundant power supply. Assume one power supply is sufficient to run the disk 
subsystem and that we are adding one redundant power supply. 


We need a formula to show what to expect when we can tolerate a failure and still 
provide service. To simplify the calculations, we assume that the lifetimes of the 
components are exponentially distributed and that there is no dependency 
between the component failures. MTTF for our redundant power supplies is the 
mean time until one power supply fails divided by the chance that the other will 
fail before the first one is replaced. Thus, if the chance of a second failure before 
repair is small, then MTTF of the pair is large. 

Since we have two power supplies and independent failures, the mean time 
until one disk fails is MTTF  ersupply/ 2. A good approximation of the proba- 
bility of a second failure is MTTR over the mean time until the other power 
supply fails. Hence, a reasonable approximation for a redundant pair of power 
supplies is 
2 ? 
5 /2 MTTF power supply 


power supply” 4 = 
MTTR 2x MTTR 


re MTTP ower supply” 2 ws MTTF 


P power supply pair ~ MTTR 
MTTF 





power supply power supply power supply 


power supply 


Using the MTTF numbers above, if we assume it takes on average 24 hours for a 
human operator to notice that a power supply has failed and replace it, the reli- 
ability of the fault tolerant pair of power supplies is 

MTTF power supply _ 200,000? 


“power supply pair i . 9x 22 
BS PEAR -xX MTTR power supply 2x24 


MTTI = 830,000,000 


making the pair about 4150 times more reliable than a single power supply. 
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Having quantified the cost, power, and dependability of computer technology, we 
are ready to quantify performance. 


1.8 Measuring, Reporting, and Summarizing Performance 


When we say one computer is faster than another is, what do we mean? The user 
of a desktop computer may say a computer is faster when a program runs in less 
time, while an Amazon.com administrator may say a computer is faster when it 
completes more transactions per hour. The computer user is interested in reduc- 
ing response time—the time between the start and the completion of an event— 
also referred to as execution time. The administrator of a large data processing 
center may be interested in increasing throughput—the total amount of work 
done in a given time. 

In comparing design alternatives, we often want to relate the performance of 
two different computers, say, X and Y. The phrase "X is faster than Y" is used 
here to mean that the response time or execution time is lower on X than on Y for 
the given task. In particular, "X is n times faster than Y" will mean 


Execution timey 


Execution time, 


Since execution time is the reciprocal of performance, the following relationship 
holds: 


l 
Execution timey Performancey Performance y 





t= z : = = : 
Execution timey l Performancey 


The phrase "the throughput of X is 13 times higher than Y" signifies here 
that the number of tasks completed per unit time on computer X is 13 times the 
number completed on Y 

Unfortunately, time is not always the metric quoted in comparing the perfor- 
mance of computers. Our position is that the only consistent and reliable measure 
of performance is the execution time of real programs, and that all proposed 
alternatives to time as the metric or to real programs as the items measured have 
eventually led to misleading claims or even mistakes in computer design. 

Even execution time can be defined in different ways depending on what we 
count. The most straightforward definition of time is called wall-clock time, 
response time, or elapsed time, which is the latency to complete a task, including 
disk accesses, memory accesses, input/output activities, operating system over- 
head—everything. With multiprogramming, the processor works on another pro- 
gram while waiting for I/O and may not necessarily minimize the elapsed time of 
one program. Hence, we need a term to consider this activity. CPU time recog- 
nizes this distinction and means the time the processor is computing, not includ- 
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ing the time waiting for I/O or running other programs. (Clearly, the response 
time seen by the user is the elapsed time of the program, not the CPU time.) 

Computer users who routinely run the same programs would be the perfect 
candidates to evaluate a new computer. To evaluate a new system the users would 
simply compare the execution time of their workloads—the mixture of programs 
and operating system commands that users run on a computer. Few are in this 
happy situation, however. Most must rely on other methods to evaluate comput- 
ers, and often other evaluators, hoping that these methods will predict per- 
formance for their usage of the new computer. 


Benchmarks 


The best choice of benchmarks to measure performance are real applications, 
such as a compiler. Attempts at running programs that are much simpler than a 
real application have led to performance pitfalls. Examples include 


e kernels, which are small, key pieces of real applications; 


e toy programs, which are 100-line programs from beginning programming 
assignments, such as quicksort; and 


e synthetic benchmarks, which are fake programs invented to try to match the 
profile and behavior of real applications, such as Dhrystone. 


All three are discredited today, usually because the compiler writer and architect 
can conspire to make the computer appear faster on these stand-in programs than 
on real applications. 

Another issue is the conditions under which the benchmarks are run. One 
way to improve the performance of a benchmark has been with benchmark- 
specific flags; these flags often caused transformations that would be illegal on 
many programs or would slow down performance on others. To restrict this pro- 
cess and increase the significance of the results, benchmark developers often 
require the vendor to use one compiler and one set of flags for all the programs in 
the same language (C or FORTRAN). In addition to the question of compiler 
flags, another question is whether source code modifications are allowed. There 
are three different approaches to addressing this question: 


1. No source code modifications are allowed. 


2. Source code modifications are allowed, but are essentially impossible. For 
example, database benchmarks rely on standard database programs that are 
tens of millions of lines of code. The database companies are highly unlikely 
to make changes to enhance the performance for one particular computer. 


3. Source modifications are allowed, as long as the modified version produces 


the same output. 


The key issue that benchmark designers face in deciding to allow modification of 
the source is whether such modifications will reflect real practice and provide 
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useful insight to users, or whether such modifications simply reduce the accuracy 
of the benchmarks as predictors of real performance. 

To overcome the danger of placing too many eggs in one basket, collections 
of benchmark applications, called benchmark suites, are a popular measure of 
performance of processors with a variety of applications. Of course, such suites 
are only as good as the constituent individual benchmarks. Nonetheless, a key 
advantage of such suites is that the weakness of any one benchmark is lessened 
by the presence of the other benchmarks. The goal of a benchmark suite is that it 
will characterize the relative performance of two computers, particularly for pro- 
grams not in the suite that customers are likely to run. 

As a cautionary example, the EDN Embedded Microprocessor Benchmark 
Consortium (or EEMBC, pronounced "embassy") is a set of 41 kernels used to 
predict performance of different embedded applications: automotive/industrial, 
consumer, networking, office automation, and telecommunications. EEMBC 
reports unmodified performance and "full fury" performance, where almost any- 
thing goes. Because they use kernels, and because of the reporting options, 
EEMBC does not have the reputation of being a good predictor of relative perfor- 
mance of different embedded computers in the field. The synthetic program 
Dhrystone, which EEMBC was trying to replace, is still reported in some embed- 
ded circles. 

One of the most successful attempts to create standardized benchmark appli- 
cation suites has been the SPEC (Standard Performance Evaluation Corporation), 
which had its roots in the late 1980s efforts to deliver better benchmarks for 
workstations. Just as the computer industry has evolved over time, so has the 
need for different benchmark suites, and there are now SPEC benchmarks to 
cover different application classes. All the SPEC benchmark suites and their 
reported results are found at www.spec.org. 

Although we focus our discussion on the SPEC benchmarks in many of the 
following sections, there are also many benchmarks developed for PCs running 
the Windows operating system. 


Desktop Benchmarks 


Desktop benchmarks divide into two broad classes: processor-intensive bench- 
marks and graphics-intensive benchmarks, although many graphics benchmarks 
include intensive processor activity. SPEC originally created a benchmark set 
focusing on processor performance (initially called SPEC89), which has evolved 
into its fifth generation: SPEC CPU2006, which follows SPEC2000, SPEC95 
SPEC92, and SPEC89. SPEC CPU2006 consists of a set of 12 integer bench- 
marks (CINT2006) and 17 floating-point benchmarks (CFP2006). Figure 1.13 
describes the current SPEC benchmarks and their ancestry. 

SPEC benchmarks are real programs modified to be portable and to minimize 
the effect of I/O on performance. The integer benchmarks vary from part of a C 
compiler to a chess program to a quantum computer simulation. The floating- 
point benchmarks include structured grid codes for finite element modeling, par- 
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SPEC2006 benchmark description 


GNU C compiler 
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Benchmark name by SPEC generation 











Interpreted string processing 





Combinatorial optimization 
Block-sorting compression 





Go game (Al) 

Video compression 
Gamesipath finding 

Search gene sequence 
Quantum computer simulation 
Discrete event simulation library 
Chess game (Al) 


CFD/blast waves 

Numerical relativity 

Finite element code 

Differential equation solver framework 
Quantum chemistry 

EM solver (freq/time domain) 

Scalable molecular dynamics (~NAMD) 
Lattice Boltzman method (fluid/air flow) 
Large eddie simulation/turbulent CFD 
Lattice quantum chromodynamics 
Molecular dynamics 

Image ray tracing 

Spare linear algebra 

Speech recognition 

Quantum chemistry/object oriented 
Weather research and forecasting 
Magneto hydrodynamics (astrophysics) 

































SPEC2006 SPEC2000 SPEC95 SPEC92 SPEC89 
a ee a 
gcc 
~ perl espresso 
— mcf fi 
ee bzip2 compress eqntott 
go vortex go sc 
h264ave gzip ijpeg 
astar eon m88ksim 
hmmer twolf 
libquantum vortex 
omnetpp vor 
sjeng crafty 
XML parsing xalancbmk parser 
bwaves fpppp 
cactusADM tomcatv 
calculix | doduc 
dealll nasa7 
gamess i “ spice 
GemsFDTD oe swim matrix300 
gromacs apsi hydro2d 
lom mgrid su2cor 
LESlie3d wupwise applu wave5 
mile apply turb3d 
namd galael 
povray mesa 
soplex ar 
sphinx3 equake 
tonto facerec 
wrt ammp 
zeusmp lucas 
fma3d 
sixtrack 


Figure 1.13 SPEC2006 programs and the evolution of the SPEC benchmarks over time, with integer programs 
above the line and floating-point programs below the line. Of the 12 SPEC2006 integer programs, 9 are written in 
C, and the rest in C++. For the floating-point programs the split is 6 in FORTRAN, 4 in C++, 3 in C, and 4 in mixed C 
and Fortran.The figure shows all 70 of the programs in the 1989,1992,1995,2000, and 2006 releases. The bench- 
mark descriptions on the left are for SPEC2006 only and do not apply to earlier ones. Programs in the same row from 
different generations of SPEC are generally not related; for example, fpppp is not a CFD code like bwaves. Gcc is the 
senior citizen of the group. Only 3 integer programs and 3 floating-point programs survived three or more genera- 
tions. Note that all the floating-point programs are new for SPEC2006. Although a few are carried over from genera- 
tion to generation, the version of the program changes and either the input or the size of the benchmark is often 
changed to increase its running time and to avoid perturbation in measurement or domination of the execution 


time by some factor other than CPU time. 








tide method codes for molecular dynamics, and sparse linear algebra codes for 
fluid dynamics. The SPEC CPU suite is useful for processor benchmarking for 
both desktop systems and single-processor servers. We will see data on many of 


these programs throughout this text. 
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In Section 1.11, we describe pitfalls that have occurred in developing the 
SPEC benchmark suite, as well as the challenges in maintaining a useful and pre- 
dictive benchmark suite. Although SPEC CPU2006 is aimed at processor perfor- 
mance, SPEC also has benchmarks for graphics and Java. 


Server Benchmarks 


Just as servers have multiple functions, so there are multiple types of bench- 
marks. The simplest benchmark is perhaps a processor throughput-oriented 
benchmark. SPEC CPU2000 uses the SPEC CPU benchmarks to construct a sim- 
ple throughput benchmark where the processing rate of a multiprocessor can be 
measured by running multiple copies (usually as many as there are processors) of 
each SPEC CPU benchmark and converting the CPU time into a rate. This leads 
to a measurement called the SPECrate. 

Other than SPECrate, most server applications and benchmarks have signifi- 
cant I/O activity arising from either disk or network traffic, including benchmarks 
for file server systems, for Web servers, and for database and transaction- 
processing systems. SPEC offers both a file server benchmark (SPECSFS) and a 
Web server benchmark (SPECWeb). SPECSFS is a benchmark for measuring 
NFS (Network File System) performance using a script of file server requests; it 
tests the performance of the I/O system (both disk and network I/O) as well as the 
processor. SPECSFS is a throughput-oriented benchmark but with important 
response time requirements. (Chapter 6 discusses some file and I/O system 
benchmarks in detail.) SPECWeb is a Web server benchmark that simulates mul- 
tiple clients requesting both static and dynamic pages from a server, as well as 
clients posting data to the server. 

Transaction-processing (TP) benchmarks measure the ability of a system to 
handle transactions, which consist of database accesses and updates. Airline res- 
ervation systems and bank ATM systems are typical simple examples of TP; 
more sophisticated TP systems involve complex databases and decision-making. 
In the mid-1980s, a group of concerned engineers formed the vendor-indepen- 
dent Transaction Processing Council (TPC) to try to create realistic and fair 
benchmarks for TP. The TPC benchmarks are described at www.tpc.org. 

The first TPC benchmark, TPC-A, was published in 1985 and has since been 
replaced and enhanced by several different benchmarks. TPC-C, initially created 
in 1992, simulates a complex query environment. TPC-H models ad hoc decision 
support—the queries are unrelated and knowledge of past queries cannot be used 
to optimize future queries. TPC-W is a transactional Web benchmark. The work- 
load is performed in a controlled Internet commerce environment that simulates 
the activities of a business-oriented transactional Web server. The most recent is 
TPC-App, an application server and Web services benchmark. The workload 
simulates the activities of a business-to-business transactional application server 
operating in a 24x7 environment. 

All the TPC benchmarks measure performance in transactions per second. In 
addition, they include a response time requirement, so that throughput perfor- 
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mance is measured only when the response time limit is met. To model real- 
world systems, higher transaction rates are also associated with larger systems, in 
terms of both users and the database to which the transactions are applied. 
Finally, the system cost for a benchmark system must also be included, allowing 
accurate comparisons of cost-performance. 


Reporting Performance Results 


The guiding principle of reporting performance measurements should be repro- 
ducibility—list everything another experimenter would need to duplicate the 
results. A SPEC benchmark report requires an extensive description of the com- 
puter and the compiler flags, as well as the publication of both the baseline and 
optimized results. In addition to hardware, software, and baseline tuning parame- 
ter descriptions, a SPEC report contains the actual performance times, shown 
both in tabular form and as a graph. A TPC benchmark report is even more com- 
plete, since it must include results of a benchmarking audit and cost information. 
These reports are excellent sources for finding the real cost of computing sys- 
tems, since manufacturers compete on high performance and cost-performance. 


Summarizing Performance Results 


In practical computer design, you must evaluate myriads of design choices for 
their relative quantitative benefits across a suite of benchmarks believed to be rel- 
evant. Likewise, consumers trying to choose a computer will rely on performance 
measurements from benchmarks, which hopefully are similar to the user's appli- 
cations. In both cases, it is useful to have measurements for a suite of benchmarks 
so that the performance of important applications is similar to that of one or more 
benchmarks in the suite and that variability in performance can be understood. In 
the ideal case, the suite resembles a statistically valid sample of the application 
space, but such a sample requires more benchmarks than are typically found in 
most suites and requires a randomized sampling, which essentially no benchmark 
suite uses. 

Once we have chosen to measure performance with a benchmark suite, we 
would like to be able to summarize the performance results of the suite in a single 
number. A straightforward approach to computing a summary result would be to 
compare the arithmetic means of the execution times of the programs in the suite. 
Alas, some SPEC programs take four times longer than others, so those programs 
would be much more important if the arithmetic mean were the single number 
used to summarize performance. An alternative would be to add a weighting fac- 
tor to each benchmark and use the weighted arithmetic mean as the single num- 
ber to summarize performance. The problem would be then how to pick weights; 
since SPEC is a consortium of competing companies, each company might have 
their own favorite set of weights, which would make it hard to reach consensus. 
One approach is to use weights that make all programs execute an equal time on 
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Example 


some reference computer, but this biases the results to the performance character- 
istics of the reference computer. 

Rather than pick weights, we could normalize execution times to a reference 
computer by dividing the time on the reference computer by the time on the com- 
puter being rated, yielding a ratio proportional to performance. SPEC uses this 
approach, calling the ratio the SPECRatio. It has a particularly useful property 
that it matches the way we compare computer performance throughout this 
text—namely, comparing performance ratios. For example, suppose that the 
SPECRatio of computer A on a benchmark was 1.25 times higher than computer 
B; then you would know 

Execution time... poronce 
SPECRatio 4 Execution time Execution time, Performance , 


1.25 


~ SPECRatio, Execution time ~ Execution time, Performance, 


reference 


Execution time, 


Notice that the execution times on the reference computer drop out and the 
choice of the reference computer is irrelevant when the comparisons are made as 
aratio, which is the approach we consistently use. Figure 1.14 gives an example. 
Because a SPECRatio is a ratio rather than an absolute execution time, the 
mean must be computed using the geometric mean. (Since SPECRatios have no 
units, comparing SPECRatios arithmetically is meaningless.) The formula is 


ion 
Geometric mean = | JI sample; 
N i=l 
In the case of SPEC, sample; is the SPECRatio for program i. Using the geomet- 
ric mean ensures two important properties: 


1. The geometric mean of the ratios is the same as the ratio of the geometric 
means. 


2. The ratio of the geometric means is equal to the geometric mean of the per- 
formance ratios, which implies that the choice of the reference computer is 
irrelevant. 


Hence, the motivations to use the geometric mean are substantial, especially 
when we use performance ratios to make comparisons. 


Show that the ratio of the geometric means is equal to the geometric mean of the 
performance ratios, and that the reference computer of SPECRatio matters not. 
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Tae Opteron Itanium 2 Opteron/Itanium Itanium/Opteron 
Benchmarks (sec) Time (sec) SPECRatio Time (sec) SPECRatio Times (sec) SPECRatios 
wupwise 1600 51.5 31.06 56.1 28.53 0.92 0.92 
swim 3100 125.0 24.73 70.7 43.85 1.77 1.77 
mgrid 1800 98.0 18.37 65.8 27.36 1.49 1.49 
applu 2100 94.0 22.34 50.9 41.25 1.85 1.85 
mesa 1400 64.6 21.69 108.0 12.99 0.60 0.60 
galgel 2900 86.4 33.57 40.0 72.47 2.16 2.16 
art 2600 92.4 28.13 21.0 123.67 4.40 4.40 
equake 1300 72.6 17.92 36.3 35.78 2.00 2.00 
facerec 1900 73.6 25.80 86.9 21.86 0.85 0.85 
ammp 2200 136.0 16.14 132.0 16.63 1.03 1.03 
lucas 2000 88.8 22.52 107.0 18.76 0.83 0.83 
fma3d 2100 120.0 17.48 131.0 16.09 0.92 0.92 
sixtrack 1100 123.0 8.95 68.8 15.99 1.79 1.79 
apsi 2600 150.0 17.36 231.0 11.27 0.65 0.65 
Geometric mean 20.86 27.12 1.30 1.30 





Figure 1.14 SPECfp2000 execution times (in seconds) for the Sun Ultra 5—the reference computer of 
SPEC2000—and execution times and SPECRatios for the AMD Opteron and Intel Itanium 2.(SPEC2000 multiplies 
the ratio of execution times by 100 to remove the decimal point from the result, so 20.86 is reported as 2086.) The 
final two columns show the ratios of execution times and SPECratios.This figure demonstrates the irrelevance of the 
reference computer in relative performance.The ratio of the execution times is identical to the ratio of the SPECRa 
tios, and the ratio of the geometric means (27.12/20.86 = 1.30) is identical to the geometric mean of the ratios (1.30). 


Answer 


Assume two computers A and B and a set of SPECRatios for each. 


n| | | SPECRatio A; 

Geometric mean 4 Nin [n sp SPECRatio A A; 

Geometric mean, = a g 7 A - MT SPECRatio B; 
n| [[ SPECRatio B; = 


N5 
i=l 





reference, 





| Execution time 
| 


So ae ae r s 
me Execution time |” Execution time |" Performance 
Aj | B, | ~ 


| Execution time, orerenee -h| LA Execution time 4 Hi Performanceg, 


i=l i Vist ant i=l 
' Execution time, 


That is, the ratio of the geometric means of the SPECRatios of A and B is the 
geometric mean of the performance ratios of A to B of all the benchmarks in the 
suite. Figure 1.14 demonstrates the validity using examples from SPEC. 
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A key question is whether a single mean summarizes the performance of the 
programs in the benchmark suite well. If we characterize the variability of the 
distribution, using the standard deviation, we can decide whether the mean is 
likely to be a good predictor. The standard deviation is more informative if we 
know the distribution has one of several standard forms. 

One useful possibility is the well-known bell-shaped normal distribution, 
whose sample data are, of course, symmetric around the mean. Another is the 
lognormal distribution, where the logarithms of the data—not the data itself—are 
normally distributed on a logarithmic scale, and thus symmetric on that scale. 
(On a linear scale, a lognormal is not symmetric, but has a long tail to the right.) 

For example, if each of two systems is 10X faster than the other on two dif- 
ferent benchmarks, the relative performance is the set of ratios {.1, 10}. How- 
ever, the performance summary should be equal performance. That is, the 
average should be 1.0, which in fact is true on a logarithmic scale. 

To characterize variability about the arithmetic mean, we use the arithmetic 
standard deviation (stdev), often called o. It is defined as: 





in 
j ? 
stdev = | X, (sample; —- Mean)” 
N, 


i= 1 


Like the geometric mean, the geometric standard deviation is multiplicative 
rather than additive. For working with the geometric mean and the geometric 
standard deviation, we can simply take the natural logarithm of the samples, 
compute the standard mean and standard deviation, and then take the exponent to 
convert back. This insight allows us to describe the multiplicative versions of 
mean and standard deviation (gstdev), also often called a, as 


n 


x ¥ l 
Geometric mean = exp] — x > In(sample;) 
n 
hat 


! 





In 
m 


[i . á 

gstdev = exp | y (In(sample;) - In(Geometric mean) ) 
jist 
n 





Note that functions provided in a modern spreadsheet program, like EXP() and 
LN(), make it easy to calculate the geometric mean and the geometric standard 
deviation. 

For a lognormal distribution, we expect that 68% of the samples fall in the 
range [Mean / gstdev, Mean x gstdev], 95% within [Mean / gstdev’, Mean x 
gstdev’], and so on. 


Example 


Answer 
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Using the data in Figure 1.14, calculate the geometric standard deviation and the 
percentage of the results that fall within a single standard deviation of the geo- 
metric mean. Are the results compatible with a lognormal distribution? 


The geometric means are 20.86 for Opteron and 27.12 for Itanium 2. As you 
might guess from the SPECRatios, the standard deviation for the Itanium 2 is 
much higher—1.93 versus 1.38—indicating that the results will differ more 
widely from the mean, and therefore are likely less predictable. The single stan- 
dard deviation range is [27.12/1.93, 27.12 x 1.93] or [14.06, 52.30] for Ita- 
nium 2 and [20.86/1.38, 20.86 x 1.38] or [15.12, 28.76] for Opteron. For 
Itanium 2, 10 of 14 benchmarks (71%) fall within one standard deviation; for 
Opteron, it is 11 of 14 (78%). Thus, both results are quite compatible with a 
lognormal distribution. 


Quantitative Principles of Computer Design 


Now that we have seen how to define, measure, and summarize performance, 
cost, dependability, and power, we can explore guidelines and principles that are 
useful in the design and analysis of computers. This section introduces important 
observations about design, as well as two equations to evaluate alternatives. 


Take Advantage of Parallelism 


Taking advantage of parallelism is one of the most important methods for 
improving performance. Every chapter in this book has an example of how 
performance is enhanced through the exploitation of parallelism. We give three 
brief examples, which are expounded on in later chapters. 

Our first example is the use of parallelism at the system level. To improve the 
throughput performance on a typical server benchmark, such as SPECWeb or 
TPC-C, multiple processors and multiple disks can be used. The workload of 
handling requests can then be spread among the processors and disks, resulting in 
improved throughput. Being able to expand memory and the number of proces- 
sors and disks is called scalability, and it is a valuable asset for servers. 

At the level of an individual processor, taking advantage of parallelism 
among instructions is critical to achieving high performance. One of the simplest 
ways to do this is through pipelining. The basic idea behind pipelining, which is 
explained in more detail in Appendix A and is a major focus of Chapter 2, is to 
overlap instruction execution to reduce the total time to complete an instruction 
sequence. A key insight that allows pipelining to work is that not every instruc- 
tion depends on its immediate predecessor, and thus, executing the instructions 
completely or partially in parallel may be possible. 
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Parallelism can also be exploited at the level of detailed digital design. For 
example, set-associative caches use multiple banks of memory that are typically 
searched in parallel to find a desired item. Modern ALUs use carry-lookahead, 
which uses parallelism to speed the process of computing sums from linear to 
logarithmic in the number of bits per operand. 


Principle of Locality 


Important fundamental observations have come from properties of programs. The 
most important program property that we regularly exploit is the principle of 
locality: Programs tend to reuse data and instructions they have used recently. A 
widely held rule of thumb is that a program spends 90% of its execution time in 
only 10% of the code. An implication of locality is that we can predict with rea- 
sonable accuracy what instructions and data a program will use in the near future 
based on its accesses in the recent past. The principle of locality also applies to 
data accesses, though not as strongly as to code accesses. 

Two different types of locality have been observed. Temporal locality states 
that recently accessed items are likely to be accessed in the near future. Spatial 
locality says that items whose addresses are near one another tend to be refer- 
enced close together in time. We will see these principles applied in Chapter 5. 


Focus on the Common Case 


Perhaps the most important and pervasive principle of computer design is to 
focus on the common case: In making a design trade-off, favor the frequent 
case over the infrequent case. This principle applies when determining how to 
spend resources, since the impact of the improvement is higher if the occur- 
rence is frequent. 

Focusing on the common case works for power as well as for resource alloca- 
tion and performance. The instruction fetch and decode unit of a processor may 
be used much more frequently than a multiplier, so optimize it first. It works on 
dependability as well. If a database server has 50 disks for every processor, as in 
the next section, storage dependability will dominate system dependability. 

In addition, the frequent case is often simpler and can be done faster than the 
infrequent case. For example, when adding two numbers in the processor, we can 
expect overflow to be a rare circumstance and can therefore improve performance 
by optimizing the more common case of no overflow. This may slow down the 
case when overflow occurs, but if that is rare, then overall performance will be 
improved by optimizing for the normal case. 

We will see many cases of this principle throughout this text. In applying this 
simple principle, we have to decide what the frequent case is and how much per- 
formance can be improved by making that case faster. A fundamental law, called 
Amdahl's Law, can be used to quantify this principle. 
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Amdahl's Law 


The performance gain that can be obtained by improving some portion of a com- 
puter can be calculated using Amdahl's Law. Amdahl's Law states that the per- 
formance improvement to be gained from using some faster mode of execution is 
limited by the fraction of the time the faster mode can be used. 

Amdahl's Law defines the speedup that can be gained by using a particular 
feature. What is speedup? Suppose that we can make an enhancement to a com- 
puter that will improve performance when it is used. Speedup is the ratio 


Performance for entire task using the enhancement when possible 
Speedup = > ; : 





Performance for entire task without using the enhancement 


Alternatively, 


Execution time for entire task without using the enhancement 





Speedup = : z = : 
Execution time for entire task using the enhancement when possible 
Speedup tells us how much faster a task will run using the computer with the 
enhancement as opposed to the original computer. 

Amdahl's Law gives us a quick way to find the speedup from some enhance- 
ment, which depends on two factors: 


1. The fraction of the computation time in the original computer that can be 
converted to take advantage of the enhancement—For example, if 20 
seconds of the execution time of a program that takes 60 seconds in total 
can use an enhancement, the fraction is 20/60. This value, which we will call 
Fraction,,hanced> is always less than or equal to 1. 


2. The improvement gained by the enhanced execution mode; that is, how much 
faster the task would run if the enhanced mode were used for the entire pro- 
gram—tThis value is the time of the original mode over the time of the 
enhanced mode. If the enhanced mode takes, say, 2 seconds for a portion of 
the program, while it is 5 seconds in the original mode, the improvement is 
5/2. We will call this value, which is always greater than 1, SpeedupenNancecl. 


The execution time using the original computer with the enhanced mode will be 
the time spent using the unenhanced portion of the computer plus the time spent 
using the enhancement: 

F raction enhanced 


= Execution time „4 X | (1 — Fraction, ...4) t aM 
= — Speedup enhanced 


Execution time, 


The overall speedup is the ratio of the execution times: 


Execution time, jq 1 








SpeeduPovern = = : x = r 
I Poverall = Execution time, Fraction .jhanced 
ew ae enhance¢ 

(1 — Fraction enhanced) + Sa 

enhanced Speedup 

J Penhanced 
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Suppose that we want to enhance the processor used for Web serving. The new 
processor is 10 times faster on computation in the Web serving application than 
the original processor. Assuming that the original processor is busy with compu- 
tation 40% of the time and is waiting for I/O 60% of the time, what is the overall 
speedup gained by incorporating the enhancement? 


4 s l l 
= 0.4, Speedupenhancea = 10, Speedup oyerat! = — p4 = 0.64 = 1.56 
0.6 + 10 i 


I ‘TACUONenhanced 


Amdahl's Law expresses the law of diminishing returns: The incremental 
improvement in speedup gained by an improvement of just a portion of the com- 
putation diminishes as improvements are added. An important corollary of 
Amdahl's Law is that if an enhancement is only usable for a fraction of a task, we 
can't speed up the task by more than the reciprocal of 1 minus that fraction. 

A common mistake in applying Amdahl's Law is to confuse "fraction of time 
converted to use an enhancement" and "fraction of time after enhancement is in 
use." If, instead of measuring the time that we could use the enhancement in a 
computation, we measure the time after the enhancement is in use, the results 
will be incorrect! 

Amdahl's Law can serve as a guide to how much an enhancement will 
improve performance and how to distribute resources to improve cost- 
performance. The goal, clearly, is to spend resources proportional to where time 
is spent. Amdahl's Law is particularly useful for comparing the overall system 
performance of two alternatives, but it can also be applied to compare two pro- 
cessor design alternatives, as the following example shows. 


A common transformation required in graphics processors is square root. Imple- 
mentations of floating-point (FP) square root vary significantly in performance, 
especially among processors designed for graphics. Suppose FP square root 
(FPSQR) is responsible for 20% of the execution time of a critical graphics 
benchmark. One proposal is to enhance the FPSQR hardware and speed up this 
operation by a factor of 10. The other alternative is just to try to make all FP 
instructions in the graphics processor run faster by a factor of 1.6; FP instructions 
are responsible for half of the execution time for the application. The design team 
believes that they can make all FP instructions run 1.6 times faster with the same 
effort as required for the fast square root. Compare these two design alternatives. 


We can compare these two alternatives by comparing the speedups: 





Speedupepsor = = = 1.22 
(1 -—0.2)+ — 





Speeduppp = ——————-= = = 1.23 
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Improving the performance of the FP operations overall is slightly better because 
of the higher frequency. 


Amdahl's Law is applicable beyond performance. Let's redo the reliability 
example from page 27 after improving the reliability of the power supply via 
redundancy from 200,000-hour to 830,000,000-hour MTTF, or 4150X better. 


The calculation of the failure rates of the disk subsystem was 


i l l l l l 

Failure rate... = 10 — p 

HUES FTE system * 7,000,000 ` 500.000 ` 200.000 ` 200,000 * 1,000,000 
_ IOF2RSt TEL 23 


~ 1,000,000 hours — 1,000,000 hours 











Therefore, the fraction of the failure rate that could be improved is 5 per million 
hours out of 23 for the whole system, or 0.22. 


The reliability improvement would be 


l l 
power supply pair = 0 > = 0.78 
I- 0.22) + —— 
; 4150 





Improvement = 1.28 





Despite an impressive 4150X improvement in reliability of one module, from the 
system's perspective, the change has a measurable but small benefit. 


In the examples above we needed the fraction consumed by the new and 
improved version; often it is difficult to measure these times directly. In the next 
section, we will see another way of doing such comparisons based on the use of 
an equation that decomposes the CPU execution time into three separate compo- 
nents. If we know how an alternative affects these three components, we can 
determine its overall performance. Furthermore, it is often possible to build simu- 
lators that measure these components before the hardware is actually designed. 


The Processor Performance Equation 


Essentially all computers are constructed using a clock running at a constant rate. 
These discrete time events are called ticks, clock ticks, clock periods, clocks, 
cycles, or clock cycles. Computer designers refer to the time of a clock period by 
its duration (e.g., 1 ns) or by its rate (e.g., 1 GHz). CPU time for a program can 
then be expressed two ways: 


CPU time = CPU clock cycles for a program x Clock cycle time 
or 
CPU clock cycles for a program 


CPU time = Clock rate 
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In addition to the number of clock cycles needed to execute a program, we 
can also count the number of instructions executed—the instruction path length 
or instruction count (IC). If we know the number of clock cycles and the instruc- 
tion count, we can calculate the average number of clock cycles per instruction 
(CPI). Because it is easier to work with, and because we will deal with simple 
processors in this chapter, we use CPI. Designers sometimes also use instructions 
per clock (IPC), which is the inverse of CPI. 

CPI is computed as 


CPI = CPU clock cycles for a program 
Instruction count 
This processor figure of merit provides insight into different styles of instruction 
sets and implementations, and we will use it extensively in the next four chapters. 
By transposing instruction count in the above formula, clock cycles can be 
defined as IC x CPI. This allows us to use CPI in the execution time formula: 


CPU time = Instruction count x Cycles per instruction x Clock cycle time 


Expanding the first formula into the units of measurement shows how the pieces 
fit together: 


Instructions | Clock cycles Seconds Seconds Sa 
a XK BS = = CPU time 
Program Instruction Clock cycle Program 
As this formula demonstrates, processor performance is dependent upon three 
characteristics: clock cycle (or rate), clock cycles per instruction, and instruction 
count. Furthermore, CPU time is equally dependent on these three characteris- 
tics: A 10% improvement in any one of them leads to a 10% improvement in 
CPU time. 
Unfortunately, it is difficult to change one parameter in complete isolation 
from others because the basic technologies involved in changing each character- 
istic are interdependent: 


e Clock cycle time—Hardware technology and organization 
e CPJ—Organization and instruction set architecture 


e Instruction count—Instruction set architecture and compiler technology 


Luckily, many potential performance improvement techniques primarily improve 
one component of processor performance with small or predictable impacts on 
the other two. 

Sometimes it is useful in designing the processor to calculate the number of 
total processor clock cycles as 


n 
CPU clock cycles = x. IC; x CPI; 


i=] 
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where IQ represents number of times instruction i is executed in a program and 
CPI; represents the average number of clocks per instruction for instruction i. 
This form can be used to express CPU time as 


CPU time = | >y IC; x cr x Clock cycle time 


i=l 


and overall CPI as 


n 
> Ic; x CPI; i as 
CPI = EL = Y aeeoo; 
Instruction count rs Instruction count t 
The latter form of the CPI calculation uses each individual CPI/ and the fraction 
of occurrences of that instruction in a program (i.e., IC; + Instruction count). CPI; 
should be measured and not just calculated from a table in the back of a reference 
manual since it must include pipeline effects, cache misses, and any other mem- 
ory system inefficiencies. 
Consider our performance example on page 40, here modified to use mea- 
surements of the frequency of the instructions and of the instruction CPI values, 
which, in practice, are obtained by simulation or by hardware instrumentation. 


Suppose we have made the following measurements: 


Frequency of FP operations = 25% 
Average CPI of FP operations = 4.0 
Average CPI of other instructions = 1.33 
Frequency of FPSQR= 2% 
CPIofFPSQR = 20 


Assume that the two design alternatives are to decrease the CPI of FPSQR to 2 or 
to decrease the average CPI of all FP operations to 2.5. Compare these two 
design alternatives using the processor performance equation. 


First, observe that only the CPI changes; the clock rate and instruction count 
remain identical. We start by finding the original CPI with neither enhancement: 


n IC, 
CPI original, = 2 CELZ (un mi) 
i= 
= (4x 25%) + (1.33 x 75%) = 2.0 


We can compute the CPI for the enhanced FPSQR by subtracting the cycles 
saved from the original CPI: 


1.10 


CPI — 2% * (CPI. yy psor — CPI 


CPI vith new FPSQR = ‘new FPS ) 
with new FPSQ of new FPSQR only 


= 2.0 -2% x (20 - 2) = 1.64 


original 


We can compute the CPI for the enhancement of all FP instructions the same way 
or by summing the FP and non-FP CPIs. Using the latter gives us 


CPlhewrp = (75% x 133) +(25% x 2.5)= 1.62 


Since the CPI of the overall FP enhancement is slightly lower, its performance 
will be marginally better. Specifically, the speedup for the overall FP enhance- 
ment is 

CPU time origina] _ 1C X Clock cycle X CPI original 
CPU time ~ ICX Clock cycle x CPI 


new FP 


Speedup new FP = pi 
new FP 
_ CPI original _ 2.00 
~ CPI 


= = 1.23 
1.625 





new FP 


Happily, we obtained this same speedup using Amdahl's Law on page 40. 


It is often possible to measure the constituent parts of the processor perfor- 
mance equation. This is a key advantage of using the processor performance 
equation versus Amdahl's Law in the previous example. In particular, it may be 
difficult to measure things such as the fraction of execution time for which a set 
of instructions is responsible. In practice, this would probably be computed by 
summing the product of the instruction count and the CPI for each of the instruc- 
tions in the set. Since the starting point is often individual instruction count and 
CPI measurements, the processor performance equation is incredibly useful. 

To use the processor performance equation as a design tool, we need to be 
able to measure the various factors. For an existing processor, it is easy to obtain 
the execution time by measurement, and the clock speed is known. The challenge 
lies in discovering the instruction count or the CPI. Most new processors include 
counters for both instructions executed and for clock cycles. By periodically 
monitoring these counters, it is also possible to attach execution time and instruc- 
tion count to segments of the code, which can be helpful to programmers trying 
to understand and tune the performance of an application. Often, a designer or 
programmer will want to understand performance at a more fine-grained level 
than what is available from the hardware counters. For example, they may want 
to know why the CPI is what it is. In such cases, simulation techniques like those 
used for processors that are being designed are used. 


Putting It All Together: Performance and 
Price-Performance 


In the "Putting It All Together" sections that appear near the end of every chapter, 
we show real examples that use the principles in that chapter. In this section, we 
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look at measures of performance and price-performance, in desktop systems 
using the SPEC benchmark and then in servers using the TPC-C benchmark. 


Performance and Price-Performance for Desktop and 
Rack-Mountable Systems 


Although there are many benchmark suites for desktop systems, a majority of 
them are OS or architecture specific. In this section we examine the processor 
performance and price-performance of a variety of desktop systems using the 
SPEC CPU2000 integer and floating-point suites. As mentioned in Figure 1.14, 
SPEC CPU2000 summarizes processor performance using a geometric mean 
normalized to a Sun Ultra 5, with larger numbers indicating higher performance. 

Figure 1.15 shows the five systems including the processors and price. Each 
system was configured with one processor, 1 GB of DDR DRAM (with ECC if 
available), approximately 80 GB of disk, and an Ethernet connection. The desk- 
top systems come with a fast graphics card and a monitor, while the rack-mount- 
able systems do not. The wide variation in price is driven by a number of factors, 
including the cost of the processor, software differences (Linux or a Microsoft 
OS versus a vendor-specific OS), system expandability, and the commoditization 
effect, which we discussed in Section 1.6. 

Figure 1.16 shows the performance and the price-performance of these five 
systems using SPEC CINT2000base and CFP2000base as the metrics. The figure 
also plots price-performance on the right axis, showing CINT or CFP per $1000 
of price. Note that in every case, floating-point performance exceeds integer per- 
formance relative to the base computer. 




















Vendor/model Processor Clock rate L2 cache Type Price 
Dell Precision Workstation 380 Intel Pentium 4 Xeon 3.8 GHz 2MB Desk $3346 
HP ProLiant BL25p AMD Opteron 252 2.6 GHz 1MB Rack $3099 
HP ProLiant ML350 G4 Intel Pentium 4 Xeon 3.4 GHz 1MB Desk $2907 
HP Integrity rx2620-2 Itanium 2 1.6 GHz 3 MB Rack $5201 
Sun Java Workstation W1 100z AMD Opteron 150 2.4 GHz 1MB Desk $2145 





Figure 1.15 Five different desktop and rack-mountable systems from three vendors using three different 
microprocessors showing the processor, its clock rate, L2 cache size, and the selling price. Figure 1.16 plots 
absolute performance and price performance. All these systems are configured with 1 GB of ECC SDRAM and 
approximately 80 GB of disk. (If software costs were not included, we added them.) Many factors are responsible 
for the wide variation in price despite these common elements. First, the systems offer different levels of expand- 
ability (with the Sun Java Workstation being the least expandable, the Dell systems being moderately expandable, 
and the HP BL25p blade server being the most expandable). Second, the cost of the processor varies by at least a 
factor of 2, with much ofthe reason for the higher costs being the size of the L2 cache and the larger die. In 2005, 
the Opteron sold for about $500 to $800 and Pentium 4 Xeon sold for about $400 to $700, depending on clock 
rates and cache size.The Itanium 2 die size is much larger than the others, so it's probably at least twice the cost. 
Third, software differences (Linux or a Microsoft OS versus a vendor-specific OS) probably affect the final price. 
These prices were as of August 2005. 
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Figure 1.16 Performance and price-performance for five systems in Figure 1.15 
measured using SPEC CINT2000 and CFP2000 as the benchmark. Price-performance 
is plotted as CINT2000 and CFP2000 performance per $1000 in system cost. These per- 
formance numbers were collected in January 2006 and prices were as of August 2005. 
The measurements are available online at www.spec.org. 


The Itanium 2-based design has the highest floating-point performance but 
also the highest cost, and hence has the lowest performance per thousand dollars, 
being off a factor of 1.1-1.6 in floating-point and 1.8-2.5 in integer performance. 
While the Dell based on the 3.8 GHz Intel Xeon with a 2 MB L2 cache has the 
high performance for CINT and second highest for CFP, it also has a much higher 
cost than the Sun product based on the 2.4 GHz AMD Opteron with a 1 MB L2 
cache, making the latter the price-performance leader for CINT and CFP. 


Performance and Price-Performance for 
Transaction-Processing Servers 


One of the largest server markets is online transaction processing (OLTP). The 
standard industry benchmark for OLTP is TPC-C, which relies on a database sys- 
tem to perform queries and updates. Five factors make the performance of TPC-C 
particularly interesting. First, TPC-C is a reasonable approximation to a real 
OLTP application. Although this is complex and time-consuming, it makes the 
results reasonably indicative of real performance for OLTP. Second, TPC-C mea- 
sures total system performance, including the hardware, the operating system, the 
TO system, and the database system, making the benchmark more predictive of 
real performance. Third, the rules for running the benchmark and reporting exe- 
cution time are very complete, resulting in numbers that are more comparable. 
Fourth, because of the importance of the benchmark, computer system vendors 
devote significant effort to making TPC-C run well. Fifth, vendors are required to 
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report both performance and price-performance, enabling us to examine both. 
For TPC-C, performance is measured in transactions per minute (TPM), while 
price-performance is measured in dollars per TPM. 

Figure 1.17 shows the characteristics of 10 systems whose performance or 
price-performance is near the top in one measure or the other. Figure 1.18 plots 
absolute performance on a log scale and price-performance on a linear scale. The 
number of disks is determined by the number of I/Os per second to match the 
performance target rather than the storage capacity need to run the benchmark. 

The highest-performing system is a 64-node shared-memory multiprocessor 
from IBM, costing a whopping $17 million. It is about twice as expensive and 
twice as fast as the same model half its size, and almost three times faster than the 
third-place cluster from HP. These five computers average 35-50 disks per pro- 
cessor and 16-20 GB of DRAM per processor. Chapter 4 discusses the design of 
multiprocessor systems, and Chapter 6 and Appendix E describe clusters. 

The computers with the best price-performance are all uniprocessors based 
on Pentium 4 Xeon processors, although the L2 cache size varies. Notice that 
these systems have about three to four times better price-performance than the 



































Vendor and system Processors Memory Storage Database/OS Price 

IBM eServer p5 595 64 IBM POWER 5 64 cards, 6548 disks IBM DB2 UDB 8.2/ $16,669,230 
@ 19 GHz, 36 MBL3 2048 GB 243,236 GB IBMAIX5LV5.3 

IBM eServer p5 595 32 IBM POWER 5 32 cards, 3298 disks Orcale 10g EE/ $8,428,470 
@ 19 GHz, 36 MB L3 1024 GB 112,885 GB IBMAIX5LV5.3 

HP Integrity 64 Intel Itanium 2 768 dimms, 2195 disks, Orcale 1Og EE/ $6,541,770 

1x5670 Cluster @ 1.5 GHz, 6 MB L3 768 GB 93,184 GB Red Hat E Linux AS 3 

HP Integrity 64 Intel Itanium 2 512 dimms, 1740 disks, MS SQL Server $5,820,285 

Superdome @ 16 GHz, 9 MB L3 1024GB 53,743 GB 2005 EE/MS Windows 

DE64b 

IBM eServer 32 IBM POWER4+ 4 cards, 1995 disks, IBM DB2 UDB 8.1/ $5,571,349 

pSeries 690 @ 19 GHz, 128 MB L3 1024GB 74,098 GB IBMAIX5LV5.2 

Dell PowerEdge 2800 1 Intel Xeon 2 dimms, 76 disks, | MS SQL Server 2000 WE/ $39,340 
@ 3.4 GHz, 2MB L2 2.5 GB 2585 GB MS Windows 2003 

Dell PowerEdge 2850 1 Intel Xeon 2 dimms, 76 disks, MS SQL Server 2000 SE/ $40,170 
@ 3.4 GHz, IMB L2 2.5 GB 1400 GB MS Windows 2003 

HP ProLiant ML350 1 Intel Xeon 3 dimms, 34 disks, MS SQL Server 2000 SE/ $27,827 
@ 3.1 GHz, 0.5MB L2 2.5 GB 696 GB MS Windows 2003 SE 

HP ProLiant ML350 1 Intel Xeon 4 dimms, 35 disks, IBMDB2UDBEEV8.1/ $29,990 
@ 3.1 GHz, 0.5MB L2 4 GB 692 GB SUSE Linux ES 9 

HP ProLiant ML350 1 Intel Xeon 4 dimms, 35 disks, IBMDB2UDBEEV8.1/ $30,600 
@ 2.8 GHz,0.5MBL2 3.25 GB 692 GB MS Windows 2003 SE 





Figure 1.17 The characteristics of 10 OLTP systems, using TPC-C as the benchmark, with either high total perfor- 
mance (top half of the table, measured in transactions per minute) or superior price-performance (bottom half 
of the table, measured in US. dollars per transactions per minute). Figure 1.18 plots absolute performance and 


price performance, and Figure 1.19 splits the price between processors, memory, storage, and software. 
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Figure 1.18 Performance and price-performance for the 10 systems in Figure 1.17 
using TPC-C as the benchmark. Price-performance is plotted as TPM per $1000 in sys- 
tem cost, although the conventional TPC-C measure is $/TPM (715 TPM/$1000 = $1.40 
TPM). These performance numbers and prices were as of July 2005. The measure- 
ments are available online at www. tpc. org. 


high-performance systems. Although these five computers also average 35-50 
disks per processor, they only use 2.5-3 GB of DRAM per processor. It is hard to 
tell whether this is the best choice or whether it simply reflects the 32-bit address 
space of these less expensive PC servers. Since doubling memory would only add 
about 4% to their price, it is likely the latter reason. 


Fallacies and Pitfalls 


The purpose of this section, which will be found in every chapter, is to explain 
some commonly held misbeliefs or misconceptions that you should avoid. We 
call such misbeliefs fallacies. When discussing a fallacy, we try to give a counter- 
example. We also discuss pitfalls—easily made mistakes. Often pitfalls are gen- 
eralizations of principles that are true in a limited context. The purpose of these 
sections is to help you avoid making these errors in computers that you design. 


Falling prey to Amdahl's Law. 


Virtually every practicing computer architect knows Amdahl's Law. Despite this, 
we almost all occasionally expend tremendous effort optimizing some feature 
before we measure its usage. Only when the overall speedup is disappointing do 
we recall that we should have measured first before we spent so much effort 
enhancing it! 
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Pitfall A single point of failure. 


The calculations of reliability improvement using Amdahl's Law on page 41 
show that dependability is no stronger than the weakest link in a chain. No matter 
how much more dependable we make the power supplies, as we did in our exam- 
ple, the single fan will limit the reliability of the disk subsystem. This Amdahl's 
Law observation led to a rule of thumb for fault-tolerant systems to make sure 
that every component was redundant so that no single component failure could 
bring down the whole system. 


Fallacy The cost of the processor dominates the cost of the system. 


Computer science is processor centric, perhaps because processors seem more 
intellectually interesting than memories or disks and perhaps because algorithms 
are traditionally measured in number of processor operations. This fascination 
leads us to think that processor utilization is the most important figure of merit. 
Indeed, the high-performance computing community often evaluates algorithms 
and architectures by what fraction of peak processor performance is achieved. 
This would make sense if most of the cost were in the processors. 

Figure 1.19 shows the breakdown of costs for the computers in Figure 1.17 
into the processor (including the cabinets, power supplies, and so on), DRAM 





Processor + 
cabinetry Memory Storage Software 














IBM eServer p5 595 28% 16% 51% 6% 
IBM eServer p5 595 13% 31% 52% 4% 
HP Integrity rx5670 Cluster 11% 22% 35% 33% 
HP Integrity Superdome 33% 32% 15% 20% 
IBM eServer pSeries 690 21% 24% 48% 7% 
Median of high-performance computers 21% 24% 48% 7% 
Dell PowerEdge 2800 6% 3% 80% 11% 
Dell PowerEdge 2850 7% 3% 76% 14% 
HP ProLiant ML350 5% 4% 70% 21% 
HP ProLiant ML350 9% 8% 65% 19% 
HP ProLiant ML350 8% 6% 65% 21% 
Median of price-performance computers 1% 4% 70% 19% 





Figure 1.19 Cost of purchase split between processor, memory, storage, and soft- 
ware for the top computers running the TPC-C benchmark in Figure 1.17.Memory is 
just the cost of the DRAM modules, so all the power and cooling for the computer is 
credited to the processor. TPC-C includes the cost of the clients to drive the TPC-C 
benchmark and the three-year cost of maintenance, which are not included here. Main- 
tenance would add about 10% to the numbers here, with differences in software main- 
tenance costs making the range be 5% to 22%. Including client hardware would add 
about 2% to the price of the high-performance servers and 7% to the PC servers. 
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memory, disk storage, and software. Even giving the processor category the 
credit for the sheet metal, power supplies, and cooling, it's only about 20% of the 
costs for the large-scale servers and less than 10% of the costs for the PC servers. 


Benchmarks remain valid indefinitely. 


Several factors influence the usefulness of a benchmark as a predictor of real per- 
formance, and some change over time. A big factor influencing the usefulness of 
a benchmark is its ability to resist "cracking," also known as "benchmark engi- 
neering" or "benchmarksmanship." Once a benchmark becomes standardized and 
popular, there is tremendous pressure to improve performance by targeted opti- 
mizations or by aggressive interpretation of the rules for running the benchmark. 
Small kernels or programs that spend their time in a very small number of lines of 
code are particularly vulnerable. 

For example, despite the best intentions, the initial SPEC89 benchmark suite 
included a small kernel, called matrix300, which consisted of eight different 300 
x 300 matrix multiplications. In this kernel, 99% of the execution time was in a 
single line (see SPEC [1989]). When an IBM compiler optimized this inner loop 
(using an idea called blocking, discussed in Chapter 5), performance improved 
by a factor of 9 over a prior version of the compiler! This benchmark tested com- 
piler tuning and was not, of course, a good indication of overall performance, nor 
of the typical value of this particular optimization. 

Even after the elimination of this benchmark, vendors found methods to tune 
the performance of others by the use of different compilers or preprocessors, as 
well as benchmark-specific flags. Although the baseline performance measure- 
ments require the use of one set of flags for all benchmarks, the tuned or opti- 
mized performance does not. In fact, benchmark-specific flags are allowed, even 
if they are illegal in general and could lead to incorrect compilation! 

Over a long period, these changes may make even a well-chosen benchmark 
obsolete; Gcc is the lone survivor from SPEC89. Figure 1.13 on page 31 lists 
the status of all 70 benchmarks from the various SPEC releases. Amazingly, 
almost 70% of all programs from SPEC2000 or earlier were dropped from the 
next release. 


The rated mean time to failure of disks is 1,200,000 hours or almost 140 years, so 
disks practically never fail. 


The current marketing practices of disk manufacturers can mislead users. How is 
such an MTTF calculated? Early in the process, manufacturers will put thousands 
of disks in a room, run them for a few months, and count the number that fail. 
They compute MTTF as the total number of hours that the disks worked cumula- 
tively divided by the number that failed. 

One problem is that this number far exceeds the lifetime of a disk, which is 
commonly assumed to be 5 years or 43,800 hours. For this large MTTF to make 
some sense, disk manufacturers argue that the model corresponds to a user who 
buys a disk, and then keeps replacing the disk every 5 years—the planned 
lifetime of the disk. The claim is that if many customers (and their great- 
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grandchildren) did this for the next century, on average they would replace a disk 
27 times before a failure, or about 140 years. 

A more useful measure would be percentage of disks that fail. Assume 1000 
disks with a 1,000,000-hour MTTF and that the disks are used 24 hours a day. If 
you replaced failed disks with a new one having the same reliability characteris- 
tics, the number that would fail in a year (8760 hours) is 


w ..,. _ Number of disks x Time period 1000 disks x 8760 hours/drive 
Failed disks = —— =z 9 
MTTF 1,000,000 hours/failure 





Stated alternatively, 0.9% would fail per year, or 4.4% over a 5-year lifetime. 

Moreover, those high numbers are quoted assuming limited ranges of temper- 
ature and vibration; if they are exceeded, then all bets are off. A recent survey of 
disk drives in real environments [Gray and van Ingen 2005] claims about 3-6% 
of SCSI drives fail per year, or an MTTF of about 150,000-300,000 hours, and 
about 3-7% of ATA drives fail per year, or an MTTF of about 125,000-300,000 
hours. The quoted MTTF of ATA disks is usually 500,000-600,000 hours. Hence, 
according to this report, real-world MTTF is about 2^4 times worse than manu- 
facturer's MTTF for ATA disks and 4-8 times worse for SCSI disks. 


Fallacy Peak performance tracks observed performance. 


Pitf a 11 


The only universally true definition of peak performance is "the performance 
level a computer is guaranteed not to exceed." Figure 1.20 shows the percentage 
of peak performance for four programs on four multiprocessors. It varies from 
5% to 58%. Since the gap is so large and can vary significantly by benchmark, 
peak performance is not generally useful in predicting observed performance. 


Fault detection can lower availability. 


This apparently ironic pitfall is because computer hardware has a fair amount of 
state that may not always be critical to proper operation. For example, it is not 
fatal if an error occurs in a branch predictor, as only performance may suffer. 

In processors that try to aggressively exploit instruction-level parallelism, not 
all the operations are needed for correct execution of the program. Mukherjee et 
al. [2003] found that less than 30% of the operations were potentially on the crit- 
ical path for the SPEC2000 benchmarks running on an Itanium 2. 

The same observation is true about programs. If a register is "dead" in a pro- 
gram—that is, the program will write it before it is read again—then errors do not 
matter. If you were to crash the program upon detection of a transient fault in a 
dead register, it would lower availability unnecessarily. 

Sun Microsystems lived this pitfall in 2000 with an L2 cache that included 
parity, but not error correction, in its Sun E3000 to Sun E10000 systems. The 
SRAMs they used to build the caches had intermittent faults, which parity 
detected. If the data in the cache was not modified, the processor simply reread 
the data from the cache. Since the designers did not protect the cache with ECC, 
the operating system had no choice but report an error to dirty data and crash the 
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Figure 1.20 Percentage of peak performance for four programs on four multipro- 
cessors scaled to 64 processors. The Earth Simulator and XI are vector processors. (See 
Appendix F) Not only did they deliver a higher fraction of peak performance, they had 
the highest peak performance and the lowest clock rates. Except for the Paratec pro- 
gram, the Power 4 and Itanium 2 systems deliver between 5% and 10% of their peak. 
From Olikeretal. [2004]. 


program. Field engineers found no problems on inspection in more than 90% of 
the cases. 

To reduce the frequency of such errors, Sun modified the Solaris operating 
system to "scrub" the cache by having a process that proactively writes dirty data 
to memory. Since the processor chips did not have enough pins to add ECC, the 
only hardware option for dirty data was to duplicate the external cache, using the 
copy without the parity error to correct the error. 

The pitfall is in detecting faults without providing a mechanism to correct 
them, Sun is unlikely to ship another computer without ECC on external caches. 


= 
= 
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Concluding Remarks 


This chapter has introduced a number of concepts that we will expand upon as we 
go through this book. 

In Chapters 2 and 3, we look at instruction-level parallelism (ILP), of which 
pipelining is the simplest and most common form. Exploiting ILP is one of the 
most important techniques for building high-speed uniprocessors. The presence 
of two chapters reflects the fact that there are several approaches to exploiting 
ILP and that it is an important and mature technology. Chapter 2 begins with an 
extensive discussion of basic concepts that will prepare you for the wide range of 
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ideas examined in both chapters. Chapter 2 uses examples that span about 35 
years, drawing from one of the first supercomputers (IBM 360/91) to the fastest 
processors in the market in 2006. It emphasizes what is called the dynamic or ran 
time approach to exploiting ILP. Chapter 3 focuses on limits and extensions to 
the ILP ideas presented in Chapter 2, including multithreading to get more from 
an out-of-order organization. Appendix A is introductory material on pipelining 
for readers without much experience and background in pipelining. (We expect it 
to be review for many readers, including those of our introductory text, Computer 
Organization and Design: The Hardware/Software Interface.) 

Chapter 4 focuses on the issue of achieving higher performance using multi- 
ple processors, or multiprocessors. Instead of using parallelism to overlap indi- 
vidual instructions, multiprocessing uses parallelism to allow multiple instruction 
streams to be executed simultaneously on different processors. Our focus is on 
the dominant form of multiprocessors, shared-memory multiprocessors, though 
we introduce other types as well and discuss the broad issues that arise in any 
multiprocessor. Here again, we explore a variety of techniques, focusing on the 
important ideas first introduced in the 1980s and 1990s. 

In Chapter 5, we turn to the all-important area of memory system design. We 
will examine a wide range of techniques that conspire to make memory look 
infinitely large while still being as fast as possible. As in Chapters 2 through 4, 
we will see that hardware-software cooperation has become a key to high- 
performance memory systems, just as it has to high-performance pipelines. This 
chapter also covers virtual machines. Appendix C is introductory material on 
caches for readers without much experience and background in them. 

In Chapter 6, we move away from a processor-centric view and discuss issues 
in storage systems. We apply a similar quantitative approach, but one based on 
observations of system behavior and using an end-to-end approach to perfor- 
mance analysis. It addresses the important issue of how to efficiently store and 
retrieve data using primarily lower-cost magnetic storage technologies. Such 
technologies offer better cost per bit by a factor of 50-100 over DRAM. In Chap- 
ter 6, our focus is on examining the performance of disk storage systems for typ- 
ical I/O-intensive workloads, like the OLTP benchmarks we saw in this chapter. 
We extensively explore advanced topics in RAID-based systems, which use 
redundant disks to achieve both high performance and high availability. Finally, 
the chapter introduces queing theory, which gives a basis for trading off utiliza- 
tion and latency. 

This book comes with a plethora of material on the companion CD, both to 
lower cost and to introduce readers to a variety of advanced topics. Figure 1.21 
shows them all. Appendices A, B, and C, which appear in the book, will be 
review for many readers. Appendix D takes the embedded computing perspective 
on the ideas of each of the chapters and early appendices. Appendix E explores 
the topic of system interconnect broadly, including wide area and system area 
networks used to allow computers to communicate. It also describes clusters, 
which are growing in importance due to their suitability and efficiency for data- 
base and Web server applications. 
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Pipelining: Basic and Intermediate Concepts 





Instruction Set Principles and Examples 





Review of Memory Hierarchies 





Embedded Systems (CD) 





Interconnection Networks (CD) 





Vector Processors (CD) 





Hardware and Software for VLIW and EPIC (CD) 





Large-Scale Multiprocessors and Scientific Applications (CD) 





Computer Arithmetic (CD) 





Survey of Instruction Set Architectures (CD) 





Historical Perspectives and References (CD) 
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Solutions to Case Study Exercises (Online) 





Figure 1.21 List of appendices. 


Appendix F explores vector processors, which have become more popular 
since the last edition due in part to the NEC Global Climate Simulator being the 
world's fastest computer for several years. Appendix G reviews VLIW hardware 
and software, which in contrast, are less popular than when EPIC appeared on the 
scene just before the last edition. Appendix H describes large-scale multiproces- 
sors for use in high performance computing. Appendix I is the only appendix that 
remains from the first edition, and it covers computer arithmetic. Appendix J is a 
survey of instruction architectures, including the 80x86, the IBM 360, the VAX, 
and many RISC architectures, including ARM, MIPS, Power, and SPARC. We 
describe Appendix K below. Appendix L has solutions to Case Study exercises. 
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Historical Perspectives and References 


Appendix K on the companion CD includes historical perspectives on the key 
ideas presented in each of the chapters in this text. These historical perspective 
sections allow us to trace the development of an idea through a series of 
machines or describe significant projects. If you're interested in examining the 
initial development of an idea or machine or interested in further reading, refer- 
ences are provided at the end of each history. For this chapter, see Section K.2, 
The Early Development of Computers, for a discussion on the early development 
of digital computers and performance measurement methodologies. 

As you read the historical material, you'll soon come to realize that one of the 
important benefits of the youth of computing, compared to many other engineer- 
ing fields, is that many of the pioneers are still altve—we can learn the history by 
simply asking them! 
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Case Study 1:Chip Fabrication Cost 


Concepts illustrated by this case study 


m Fabrication Cost 
e Fabrication Yield 


e Defect Tolerance through Redundancy 


There are many factors involved in the price of a computer chip. New, smaller 
technology gives a boost in performance and a drop in required chip area. In the 
smaller technology, one can either keep the small area or place more hardware on 
the chip in order to get more functionality. In this case study, we explore how dif- 
ferent design decisions involving fabrication technology, area, and redundancy 
affect the cost of chips. 


[10/10/Discussion] <1.5, 1.5> Figure 1.22 gives the relevant chip statistics that 
influence the cost of several current chips. In the next few exercises, you will be 
exploring the trade-offs involved between the AMD Opteron, a single-chip pro- 
cessor, and the Sun Niagara, an 8-core chip. 


a. [10] <1.5> What is the yield for the AMD Opteron? 
b. [10] <1.5> What is the yield for an 8-core Sun Niagara processor? 


c. [Discussion] <1.4, 16> Why does the Sun Niagara have a worse yield than 
the AMD Opteron, even though they have the same defect rate? 


[20/20/20/20/20] <1.7> You are trying to figure out whether to build a new fabri- 
cation facility for your IBM Power5 chips. It costs $1 billion to build a new fabri- 
cation facility. The benefit of the new fabrication is that you predict that you will 
be able to sell 3 times as many chips at 2 times the price of the old chips. The new 
chip will have an area of 186 mm’, with a defect rate of .7 defects per cm . 
Assume the wafer has a diameter of 300 mm. Assume it costs $500 to fabricate a 
wafer in either technology. You were previously selling the chips for 40% more 
than their cost. 














Die size Estimated defect Manufacturing Transistors 
Chip (mm?) rate (per cm?) size (nm) (millions) 
IBM Power5 389 30 130 276 
Sun Niagara 380 75 90 279 
AMD Opteron 199 15 90 233 





Figure 1.22 Manufacturing cost factors for several modern processors, a = 4. 
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[20] <1.5> What is the cost of the old Power5 chip? 
[20] <1.5> What is the cost of the new Power5 chip? 
[20] 

20] <1.5> What is the profit on each new PowerS chip? 


< 1. 5> What was the profit on each old PowerS5 chip ? 


[ 
[20] <1.5> If you sold 500,000 old Power5 chips per month, how long will it 
take to recoup the costs of the new fabrication facility? 


[20/20/10/10/20] <1.7> Your colleague at Sun suggests that, since the yield is so 
poor, it might make sense to sell two sets of chips, one with 8 working processors 
and one with 6 working processors. We will solve this exercise by viewing the 
yield as a probability of no defects occurring in a certain area given the defect 
rate. For the Niagara, calculate probabilities based on each Niagara core sepa- 
rately (this may not be entirely accurate, since the yield equation is based on 
empirical evidence rather than a mathematical calculation relating the probabili- 
ties of finding errors in different portions of the chip). 


a. 


[20] <1.7> Using the yield equation for the defect rate above, what is the 
probability that a defect will occur on a single Niagara core (assuming the 
chip is divided evenly between the cores) in an 8-core chip? 


[20] <1.7> What is the probability that a defect will occur on one or two cores 
(but not more than that)? 


[10] <1.7> What is the probability that a defect will occur on none of the 
cores? 


[10] <1.7> Given your answers to parts (b) and (c), what is the number of 6- 
core chips you will sell for every 8-core chip? 


[20] <1.7> If you sell your 8-core chips for $150 each, the 6-core chips for 
$100 each, the cost per die sold is $80, your research and development budget 
was $200 million, and testing itself costs $1.50 per chip, how many proces- 
sors would you need to sell in order to recoup costs? 


Case Study 2: Power Consumption in Computer Systems 


Concepts illustrated by this case study 


Amdahl's Law 
Redundancy 
MTTF 


Power Consumption 


Power consumption in modern systems is dependent on a variety of factors, 
including the chip clock frequency, efficiency, the disk drive speed, disk drive uti- 
lization, and DRAM. The following exercises explore the impact on power that 
different design decisions and/or use scenarios have. 
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Component 

type Product Performance Power 

Processor Sun Niagara 8-core 12 GHz 72-79W peak 
Intel Pentium 4 2 GHz 48.9-66W 

DRAM Kingston X64C3AD2 1 GB 184-pin 3.7W 
Kingston D2N3 1 GB 240-pin 2.3W 

Hard drive DiamondMax 16 5400 rpm 7.0W read/seek, 2.9 W idle 
DiamondMax Plus 9 7200 rpm 7.9W read/seek, 4.0 W idle 





Figure 1.23 Power consumption of several computer components. 


[20/10/20] <1.6> Figure 1.23 presents the power consumption of several com- 
puter system components. In this exercise, we will explore how the hard drive 
affects power consumption for the system. 


a. 


[20] <1.6> Assuming the maximum load for each component, and a power 
supply efficiency of 70%, what wattage must the server's power supply 
deliver to a system with a Sun Niagara 8-core chip, 2 GB 184-pin Kingston 
DRAM, and two 7200 rpm hard drives? 


[10] <1.6> How much power will the 7200 rpm disk drive consume if it is 
idle rougly 40% of the time? 


[20] <1.6> Assume that rpm is the only factor in how long a disk is not idle 
(which is an oversimplification of disk performance). In other words, assume 
that for the same set of requests, a 5400 rpm disk will require twice as much 
time to read data as a 10,800 rpm disk. What percentage of the time would the 
5400 rpm disk drive be idle to perform the same transactions as in part (b)? 


[10/10/20] <1.6, 1.7> One critical factor in powering a server farm is cooling. If 
heat is not removed from the computer efficiently, the fans will blow hot air back 
onto the computer, not cold air. We will look at how different design decisions 
affect the necessary cooling, and thus the price, of a system. Use Figure 1.23 for 
your power calculations. 


a. 


[10] <1.6> A cooling door for a rack costs $4000 and dissipates 14 KW (into 
the room; additional cost is required to get it out of the room). How many 
servers with a Sun Niagara 8-core processor, 1 GB 240-pin DRAM, and a 
single 5400 rpm hard drive can you cool with one cooling door? 


[10] <1.6, 1.8> You are considering providing fault tolerance for your hard 
drive. RAID 1 doubles the number of disks (see Chapter 6). Now how many 
systems can you place on a single rack with a single cooler? 


[20] < 1. 8> In a single rack, the MTTF of each processor is 4500 hours, of the 
hard drive is 9 million hours, and of the power supply is 30K hours. For a 
rack with 8 processors, what is the MTTF for the rack? 
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Sun Fire T2000 IBMx346 
Power (watts) 298 438 
SPECjbb (op/s) 63,378 39,985 
Power (watts) 330 438 
SPECWeb (composite) 14,001 4,348 





Figure 1.24 Sun power / performance comparison as selectively reported by Sun. 


[10/10/Discussion] <1.2, 19> Figure 1.24 gives a comparison of power and per- 
formance for several benchmarks comparing two servers: Sun Fire T2000 (which 
uses Niagara) and IBM x346 (using Intel Xeon processors). 


a. 


[10] <1.9> Calculate the performance/power ratio for each processor on each 
benchmark. 


[10] <1.9> If power is your main concern, which would you choose? 


[Discussion] <1.2> For the database benchmarks, the cheaper the system, the 
lower cost per database operation the system is. This is counterintuitive: 
larger systems have more throughput, so one might think that buying a larger 
system would be a larger absolute cost, but lower per operation cost. Since 
this is true, why do any larger server farms buy expensive servers? {Hint: 
Look at exercise 14 for some reasons.) 


[10/20/20/20] <1.7, 1.10> Your company's internal studies show that a single- 
core system is sufficient for the demand on your processing power. You are 
exploring, however, whether you could save power by using two cores. 


a. 


b. 


[10] <1.10> Assume your application is 100% parallelizable. By how much 
could you decrease the frequency and get the same performance? 


[20] <1.7> Assume that the voltage may be decreased linearly with the fre- 
quency. Using the equation in Section 1.5, how much dynamic power would 
the dual-core system require as compared to the single-core system? 


[20] <1.7, 1.10> Now assume the voltage may not decrease below 30% of the 
original voltage. This voltage is referred to as the "voltage floor," and any 
voltage lower than that will lose the state. What percent of parallelization 
gives you a voltage at the voltage floor? 


[20] <1.7, 1.10> Using the equation in Section 1.5, how much dynamic 
power would the dual-core system require from part (a) compared to the 
single-core system when taking into account the voltage floor? 
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Case Study 3:The Cost of Reliability (and Failure) in Web 
Servers 


Concepts illustrated by this case study 


e TPCC 
e Reliability of Web Servers 
e MTTF 


This set of exercises deals with the cost of not having reliable Web servers. The 
data is in two sets: one gives various statistics for Gap.com, which was down for 
maintenance for two weeks in 2005 [AP 2005]. The other is for Amazon.com, 
which was not down, but has better statistics on high-load sales days. The exer- 
cises combine the two data sets and require estimating the economic cost to the 
shutdown. 


[10/10/20/20] <1.2, 1.9> On August 24, 2005, three Web sites managed by the 
Gap—Gap.com, 0ldNavy.com, and BananaRepublic.com—were taken down for 
improvements [AP 2005]. These sites were virtually inaccessible for the next two 
weeks. Using the statistics in Figure 1.25, answer the following questions, which 
are based in part on hypothetical assumptions. 





a. [10] <1.2> In the third quarter of 2005, the Gap's revenue was $3.9 billion 
[Gap 2005]. The Web site returned live on September 7, 2005 [Internet 
Retailer 2005]. Assume that online sales total $1.4 million per day, and that 
everything else remains constant. What would the Gap's estimated revenue be 
third quarter 2005? 


b. [10] <1.2> If this downtime occurred in the fourth quarter, what would you 
estimate the cost of the downtime to be? 



































Company Time period Amount Type 
Gap 3rd qtr 2004 $4 billion Sales 
4th qtr 2004 $4.9 billion Sales 
3rd qtr 2005 $3.9 billion Sales 
4th qtr 2005 $4.8 billion Sales 
3rd qtr 2004 $107 million Online sales 
3rd qtr 2005 $106 million Online sales 
Amazon 3rd qtr 2005 $1.86 billion Sales 
4th qtr 2005 $2.98 billion Sales 
4th qtr 2005 108 million Items sold 
Dec 12, 2005 3.6 million Items sold 





Figure 1.25 Statistics on sales for Gap and Amazon. Data compiled from AP [2005], 
Internet Retailer [2005], Gamasutra [2005], Seattle PI [2005], MSN Money [2005], Gap 
[2005], and Gap [2006]. 
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c. [20] <1.2> When the site returned, the number of users allowed to visit the 
site at one time was limited. Imagine that it was limited to 50% of the cus- 
tomers who wanted to access the site. Assume that each server costs $7500 to 
purchase and set up. How many servers, per day, could they purchase and 
install with the money they are losing in sales? 


d. [20] <1.2, 1.9> Gap.com had 2.6 million visitors in July 2004 [AP 2005]. On 
average, a user views 8.4 pages per day on Gap.com. Assume that the high- 
end servers at Gap.com are running SQLServer software, with a TPCC 
benchmark estimated cost of $5.38 per transaction. How much would it cost 
for them to support their online traffic at Gap.com.? 

[10/10] <1.8> The main reliability measure is MTTF. We will now look at differ- 

ent systems and how design decisions affect their reliability. Refer to Figure 1.25 

for company statistics. 

a. [10] <1.8> We have a single processor with an FIT of 100. What is the MTTF 
for this system? 

b. [10] <1.8> Ifit takes 1 day to get the system running again, what is the avail- 
ability of the system? 

[20] <1.8> Imagine that the government, to cut costs, is going to build a super- 

computer out of the cheap processor system in Exercise 19 rather than a special- 

purpose reliable system. What is the MTTF for a system with 1000 processors? 

Assume that if one fails, they all fail. 

[20/20] <1.2, 1.8> In a server farm such as that used by Amazon or the Gap, a 

single failure does not cause the whole system to crash. Instead, it will reduce the 

number of requests that can be satisfied at any one time. 

a. [20] <1.8> If a company has 10,000 computers, and it experiences cata- 
strophic failure only if 1⁄3 of the computers fail, what is the MTTF for the 
system? 

b. [20] <1.2, 1.8> Ifit costs an extra $1000, per computer, to double the MTTF, 
would this be a good business decision? Show your work. 


Case Study 4: Performance 


Concepts illustrated by this case study 


e Arithmetic Mean 

e Geometric Mean 

e Parallelism 

e Amdahl's Law 

e Weighted Averages 
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In this set of exercises, you are to make sense of Figure 1.26, which presents the 
performance of selected processors and a fictional one (Processor X), as reported 
by www.tomshardware.com. For each system, two benchmarks were run. One 
benchmark exercised the memory hierarchy, giving an indication of the speed of 
the memory for that system. The other benchmark, Dhrystone, is a CPU-intensive 
benchmark that does not exercise the memory system. Both benchmarks are dis- 
played in order to distill the effects that different design decisions have on mem- 
ory and CPU performance. 





[10/10/Discussion/10/20/Discussion] <1.7> Make the following calculations on 
the raw data in order to explore how different measures color the conclusions one 
can make. (Doing these exercises will be much easier using a spreadsheet.) 


a. [10] <1.8> Create a table similar to that shown in Figure 1.26, except express 
the results as normalized to the Pentium D for each benchmark. 


b. [10] <1.9> Calculate the arithmetic mean of the performance of each proces- 
sor. Use both the original performance and your normalized performance cal- 
culated in part (a). 


c. [Discussion] <1.9> Given your answer from part (b), can you draw any con- 
flicting conclusions about the relative performance of the different proces- 
sors? 


d. [10] <1.9> Calculate the geometric mean of the normalized performance of 
the dual processors and the geometric mean of the normalized performance 
of the single processors for the Dhrystone benchmark. 


e. [20] <1.9> Plot a 2D scatter plot with the x-axis being Dhrystone and the y- 
axis being the memory benchmark. 


f. [Discussion] <1.9> Given your plot in part (e), in what area does a dual- 
processor gain in performance? Explain, given your knowledge of parallel 
processing and architecture, why these results are as they are. 





























Clock frequency Memory Dhrystone 
Chip # of cores (MHz) performance performance 
Athlon 64X2 4800+ 2 2,400 3,423 20,718 
Pentium EE 840 2 2,200 3,228 18,893 
Pentium D 820 2 3,000 3,000 15,220 
Athlon 64X2 3800+ 2 3,200 2,941 17,129 
Pentium 4 1 2,800 2,731 7,621 
Athlon 64 3000+ 1 1,800 2,953 7,628 
Pentium 4 570 1 2,800 3,501 11,210 
Processor X 1 3,000 7,000 5,000 





Figure 1.26 Performance of several processors on two benchmarks. 
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[10/10/20] <1.9> Imagine that your company is trying to decide between a 
single-processor system and a dual-processor system. Figure 1.26 gives the per- 
formance on two sets of benchmarks—a memory benchmark and a processor 
benchmark. You know that your application will spend 40% of its time on 
memory-centric computations, and 60% of its time on processor-centric compu- 
tations. 


a. [10] <1.9> Calculate the weighted execution time of the benchmarks. 


b. [10] <1.9> How much speedup do you anticipate getting if you move from 
using a Pentium 4 570 to an Athlon 64 X2 4800+ on a CPU-intensive applica- 
tion suite? 

c. [20] <1.9> At what ratio of memory to processor computation would the per- 
formance of the Pentium 4 570 be equal to the Pentium D 820? 


[10/10/20/20] < 1.10> Your company has just bought a new dual Pentium proces- 
sor, and you have been tasked with optimizing your software for this processor. 
You will ran two applications on this dual Pentium, but the resource requirements 
are not equal. The first application needs 80% of the resources, and the other only 
20% of the resources. 


a. [10] <1.10> Given that 40% of the first application is parallelizable, how 
much speedup would you achieve with that application if ran in isolation? 


b. [10] <1.10> Given that 99% of the second application is parallelizable, how 
much speedup would this application observe if run in isolation? 


c. [20] <1.10> Given that 40% of the first application is parallelizable, how 
much overall system speedup would you observe if you parallelized it? 


d. [20] <1.10> Given that 99% of the second application is parallelizable, how 
much overall system speedup would you get? 
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Parallelism and Its 
Exploitation 


"Who's first?" 
"America." 
"Who's second?" 


"Sir, there is no second." 
Dialog between two observers 
of the sailing race laternamed 
"The America's Cup"and run 
every few years—the inspira- 
tion for John Cocke's naming of 
the IBM research processor as 
"America. "This processor was 
the precursor to the RS/6000 
series and the first superscalar 
microprocessor. 
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Chapter Two Jnstruction-Level Parallelism and Its Exploitation 


2.1 


Instruction-Level Parallelism: Concepts and Challenges 


All processors since about 1985 use pipelining to overlap the execution of 
instructions and improve performance. This potential overlap among instructions 
is called instruction-level parallelism (ILP), since the instructions can be evalu- 
ated in parallel. In this chapter and Appendix G, we look at a wide range of tech- 
niques for extending the basic pipelining concepts by increasing the amount of 
parallelism exploited among instructions. 

This chapter is at a considerably more advanced level than the material on 
basic pipelining in Appendix A. If you are not familiar with the ideas in Appendix 
A, you should review that appendix before venturing into this chapter. 

We start this chapter by looking at the limitation imposed by data and control 
hazards and then turn to the topic of increasing the ability of the compiler and the 
processor to exploit parallelism. These sections introduce a large number of con- 
cepts, which we build on throughout this chapter and the next. While some of the 
more basic material in this chapter could be understood without all of the ideas in 
the first two sections, this basic material is important to later sections of this 
chapter as well as to Chapter 3. 

There are two largely separable approaches to exploiting ILP: an approach 
that relies on hardware to help discover and exploit the parallelism dynamically, 
and an approach that relies on software technology to find parallelism, statically 
at compile time. Processors using the dynamic, hardware-based approach, 
including the Intel Pentium series, dominate in the market; those using the static 
approach, including the Intel Itanium, have more limited uses in scientific or 
application-specific environments. 

In the past few years, many of the techniques developed for one approach 
have been exploited within a design relying primarily on the other. This chapter 
introduces the basic concepts and both approaches. The next chapter focuses on 
the critical issue of limitations on exploiting ILP. 

In this section, we discuss features of both programs and processors that limit 
the amount of parallelism that can be exploited among instructions, as well as the 
critical mapping between program structure and hardware structure, which is key 
to understanding whether a program property will actually limit performance and 
under what circumstances. 

The value of the CPI (cycles per instruction) for a pipelined processor is the 
sum of the base CPI and all contributions from stalls: 


Pipeline CPI = Ideal pipeline CPI + Structural stalls + Data hazard stalls + Control stalls 


The ideal pipeline CPIis a measure of the maximum performance attainable by 
the implementation. By reducing each of the terms of the right-hand side, we 
minimize the overall pipeline CPI or, alternatively, increase the IPC (instructions 
per clock). The equation above allows us to characterize various techniques by 
what component of the overall CPI a technique reduces. Figure 2.1 shows the 
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Technique Reduces Section 
Forwarding and bypassing Potential data hazard stalls A.2 
Delayed branches and simple branch scheduling Control hazard stalls A.2 
Basic dynamic scheduling (scoreboarding) Data hazard stalls from true dependences A.7 
Dynamic scheduling with renaming Data hazard stalls and stalls from antidependences 2.4 

and output dependences 
Branch prediction Control stalls 2.3 
Issuing multiple instructions per cycle Ideal CPI 2.7, 2.8 
Hardware speculation Data hazard and control hazard stalls 2.6 
Dynamic memory disambiguation Data hazard stalls with memory 2.4, 2.6 
Loop unrolling Control hazard stalls 2.2 
Basic compiler pipeline scheduling Data hazard stalls A.2, 2.2 
Compiler dependence analysis, software Ideal CPI, data hazard stalls G.2, G3 
pipelining, trace scheduling 
Hardware support for compiler speculation Ideal CPI, data hazard stalls, branch hazard stalls G4, G5 





Figure 2.1 The major techniques examined in Appendix A, Chapter 2, or Appendix G are shown together with 
the component of the CPI equation that the technique affects. 


techniques we examine in this chapter and in Appendix G, as well as the topics 
covered in the introductory material in Appendix A. In this chapter we will see 
that the techniques we introduce to decrease the ideal pipeline CPI can increase 
the importance of dealing with hazards. 


What Is Instruction-Level Parallelism? 


All the techniques in this chapter exploit parallelism among instructions. The 
amount of parallelism available within a basic block—a straight-line code 
sequence with no branches in except to the entry and no branches out except at 
the exit—is quite small. For typical MIPS programs, the average dynamic 
branch frequency is often between 15% and 25%, meaning that between three 
and six instructions execute between a pair of branches. Since these instructions 
are likely to depend upon one another, the amount of overlap we can exploit 
within a basic blocklsJikely to be less than the average basic block size. To 
obtain substantial performance enhancements, we must exploit ILP across mul- 
tiple basic blocks. 

The simplest and most common way to increase the ILP is to exploit parallel- 
ism among iterations of a loop. This type of parallelism is often called loop-level 
parallelism. Here is a simple example of a loop, which adds two 1000-element 
arrays, that is completely parallel: 
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for (i=1; i<=1000; i=i+1) 
x[i] = x[i] + y[i]. 


Every iteration of the loop can overlap with any other iteration, although within 
each loop iteration there is little or no opportunity for overlap. 

There are a number of techniques we will examine for converting such loop- 
level parallelism into instruction-level parallelism. Basically, such techniques 
work by unrolling the loop either statically by the compiler (as in the next sec- 
tion) or dynamically by the hardware (as in Sections 2.5 and 2.6). 

An important alternative method for exploiting loop-level parallelism is the 
use of vector instructions (see Appendix F). A vector instruction exploits data- 
level parallelism by operating on data items in parallel. For example, the above 
code sequence could execute in four instructions on some vector processors: two 
instructions to load the vectors x and y from memory, one instruction to add the 
two vectors, and an instruction to store back the result vector. Of course, these 
instructions would be pipelined and have relatively long latencies, but these 
latencies may be overlapped. 

Although the development of the vector ideas preceded many of the tech- 
niques for exploiting ILP, processors that exploit ILP have almost completely 
replaced vector-based processors in the general-purpose processor market. Vector 
instruction sets, however, have seen a renaissance, at least for use in graphics, 
digital signal processing, and multimedia applications. 


Data Dependences and Hazards 


Determining how one instruction depends on another is critical to determining 
how much parallelism exists in a program and how that parallelism can be 
exploited. In particular, to exploit instruction-level parallelism we must deter- 
mine which instructions can be executed in parallel. If two instructions are paral- 
lel, they can execute simultaneously in a pipeline of arbitrary depth without 
causing any stalls, assuming the pipeline has sufficient resources (and hence no 
structural hazards exist). If two instructions are dependent, they are not parallel 
and must be executed in order, although they may often be partially overlapped. 
The key in both cases is to determine whether an instruction is dependent on 
another instruction. 


Data Dependences 


There are three different types of dependences: data dependences (also called 
true data dependences), name dependences, and control dependences. An instruc- 
tion j is data dependent on instruction i if either of the following holds: 


e instruction i produces a result that may be used by instruction j, or 


e instruction j is data dependent on instruction k, and instruction k is data 
dependent on instruction /. 
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The second condition simply states that one instruction is dependent on another if 
there exists a chain of dependences of the first type between the two instructions. 
This dependence chain can be as long as the entire program. Note that a depen- 
dence within a single instruction (such as AND R1.R1.R1) is not considered a 
dependence. 

For example, consider the following MIPS code sequence that increments a 
vector of values in memory (starting at O(R1), and with the last element at 
8(R2)), by a scalar in register F2. (For simplicity, throughout this chapter, our 
examples ignore the effects of delayed branches.) 


Loop: L.D FO,0(R1) 
ADD.D F4,FO,F2 
S.D  F4,0(R1) 
DADDUI R1,R1,#-8 
BNE R1,R2,LOOP 


;F0=array element 

;add scalar in F2 

;store result 

¿decrement pointer 8 bytes 
sbranch R1!=R2 


The data dependences in this code sequence involve both floating-point data: 


Loop: L.D FO,0(R1) 
ADD.D T F0, F2 
S.D 4,0(R1) 


and integer data: 


DADDIU R1,R1,-8 


;FO=array element 
sadd scalar in F2 


¿store result 


¿decrement pointer 


38 bytes (per DW) 


BNE R1,R2,Loop ;branch R1!=R2 


Both of the above dependent sequences, as shown by the arrows, have each 
instruction depending on the previous one. The arrows here and in following 
examples show the order that must be preserved for correct execution. The arrow 
points from an instruction that must precede the instruction that the arrowhead 
points to. 

If two instructions are data dependent, they cannot execute simultaneously or 
be completely overlapped. The dependence implies that there would be a chain of 
one or more data hazards between the two instructions. (See Appendix A for a 
brief description of data hazards, which we will define precisely in a few pages.) 
Executing the instructions simultaneously will cause a processor with pipeline 
interlocks (and a pipeline depth longer than the distance between the instructions 
in cycles) to detect a hazard and stall, thereby reducing or eliminating the over- 
lap. In a processor without interlocks that relies on compiler scheduling, the com- 
piler cannot schedule dependent instructions in such a way that they completely 
overlap, since the program will not execute correctly. The presence of a data 
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dependence in an instruction sequence reflects a data dependence in the source 
code from which the instruction sequence was generated. The effect of the origi- 
nal data dependence must be preserved. 

Dependences are a property of programs. Whether a given dependence results 
in an actual hazard being detected and whether that hazard actually causes a stall 
are properties of the pipeline organization. This difference is critical to under- 
standing how instruction-level parallelism can be exploited. 

A data dependence conveys three things: (1) the possibility of a hazard, (2) the 
order in which results must be calculated, and (3) an upper bound on how much 
parallelism can possibly be exploited. Such limits are explored in Chapter 3. 

Since a data dependence can limit the amount of instruction-level parallelism 
we can exploit, a major focus of this chapter is overcoming these limitations. A 
dependence can be overcome in two different ways: maintaining the dependence 
but avoiding a hazard, and eliminating a dependence by transforming the code. 
Scheduling the code is the primary method used to avoid a hazard without alter- 
ing a dependence, and such scheduling can be done both by the compiler and by 
the hardware. 

A data value may flow between instructions either through registers or 
through memory locations. When the data flow occurs in a register, detecting the 
dependence is straightforward since the register names are fixed in the instruc- 
tions, although it gets more complicated when branches intervene and correct- 
ness concerns force a compiler or hardware to be conservative. 

Dependences that flow through memory locations are more difficult to detect, 
since two addresses may refer to the same location but look different: For exam- 
ple, 100(R4) and 20(R6) may be identical memory addresses. In addition, the 
effective address of a load or store may change from one execution of the instruc- 
tion to another (so that 20 (R4) and 20 (R4) may be different), further complicat- 
ing the detection of a dependence. 

In this chapter, we examine hardware for detecting data dependences that 
involve memory locations, but we will see that these techniques also have limita- 
tions. The compiler techniques for detecting such dependences are critical in 
uncovering loop-level parallelism, as we will see in Appendix G. 


Name Dependences 


The second type of dependence is a name dependence. A name dependence 
occurs when two instructions use the same register or memory location, called a 
name, but there is no flow of data between the instructions associated with that 
name. There are two types of name dependences between an instruction ;' that 
precedes instruction j in program order: 


1. An antidependence between instruction i and instruction j occurs when 
instruction j writes a register or memory location that instruction i reads. The 
original ordering must be preserved to ensure that i reads the correct value. In 
the example on page 69, there is an antidependence between S. D and DADDIU 
on register Rl. 
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2. An output dependence occurs when instruction i and instruction j write the 
same register or memory location. The ordering between the instructions 
must be preserved to ensure that the value finally written corresponds to 
instruction j. 


Both antidependences and output dependences are name dependences, as 
opposed to true data dependences, since there is no value being transmitted 
between the instructions. Since a name dependence is not a true dependence, 
instructions involved in a name dependence can execute simultaneously or be 
reordered, if the name (register number or memory location) used in the instruc- 
tions is changed so the instructions do not conflict. 

This renaming can be more easily done for register operands, where it is 
called register renaming. Register renaming can be done either statically by a 
compiler or dynamically by the hardware. Before describing dependences arising 
from branches, let's examine the relationship between dependences and pipeline 
data hazards. 


Data Hazards 


A hazard is created whenever there is a dependence between instructions, and 
they are close enough that the overlap during execution would change the order 
of access to the operand involved in the dependence. Because of the dependence, 
we must preserve what is called program order, that is, the order that the instruc- 
tions would execute in if executed sequentially one at a time as determined by the 
original source program. The goal of both our software and hardware techniques 
is to exploit parallelism by preserving program order only where it affects the out- 
come of the program. Detecting and avoiding hazards ensures that necessary pro- 
gram order is preserved. 

Data hazards, which are informally described in Appendix A, may be classi- 
fied as one of three types, depending on the order of read and write accesses in 
the instructions. By convention, the hazards are named by the ordering in the pro- 
gram that must be preserved by the pipeline. Consider two instructions i and j, 
with i preceding j in program order. The possible data hazards are 


e RAW (read after write)—j tries to read a source before i writes it, so j incor- 
rectly gets the old value. This hazard is the most common type and corre- 
sponds to a true data dependence. Program order must be preserved to ensure 
that7 receives the value from i. 


u WAW (write after write)—j tries to write an operand before it is written by i. 
The writes end up being performed in the wrong order, leaving the value writ- 
ten by i rather than the value written by j in the destination. This hazard corre- 
sponds to an output dependence. WAW hazards are present only in pipelines 
that write in more than one pipe stage or allow an instruction to proceed even 
when a previous instruction is stalled. 


ii 


Chapter Two Instruction-Level Parallelism and Its Exploitation 


e WAR (write after read)—j tries to write a destination before it is read by i, so 
i incorrectly gets the new value. This hazard arises from an antidependence. 
WAR hazards cannot occur in most static issue pipelines—even deeper pipe- 
lines or floating-point pipelines—because all reads are early (in ID) and all 
writes are late (in WB). (See Appendix A to convince yourself.) A WAR haz- 
ard occurs either when there are some instructions that write results early in 
the instruction pipeline and other instructions that read a source late in the 
pipeline, or when instructions are reordered, as we will see in this chapter. 


Note that the RAR (read after read) case is not a hazard. 


Control Dependences 


The last type of dependence is a control dependence. A control dependence deter- 
mines the ordering of an instruction, i, with respect to a branch instruction so that 
the instruction i is executed in correct program order and only when it should be. 
Every instruction, except for those in the first basic block of the program, is con- 
trol dependent on some set of branches, and, in general, these control depen- 
dences must be preserved to preserve program order. One of the simplest 
examples of a control dependence is the dependence of the statements in the 
"then" part of an if statement on the branch. For example, in the code segment 


S1 is control dependent on p1, and S2 is control dependent on p2 but not on p1. 
In general, there are two constraints imposed by control dependences: 


1. An instruction that is control dependent on a branch cannot be moved before 
the branch so that its execution is no longer controlled by the branch. For 
example, we cannot take an instruction from the then portion of an if state- 
ment and move it before the if statement. 


2. An instruction that is not control dependent on a branch cannot be moved 
after the branch so that its execution is controlled by the branch. For example, 
we cannot take a statement before the if statement and move it into the then 
portion. 


When processors preserve strict program order, they ensure that control 
dependences are also preserved. We may be willing to execute instructions that 
should not have been executed, however, thereby violating the control depen- 
dences, ifwe can do so without affecting the correctness of the program. Con- 
trol dependence is not the critical property that must be preserved. Instead, the 


2.1 Instruction-Level Parallelism: Concepts and Challenges * 73 


two properties critical to program correctness—and normally preserved by 
maintaining both data and control dependence—are the exception behavior and 
the data flow. 

Preserving the exception behavior means that any changes in the ordering of 
instruction execution must not change how exceptions are raised in the program. 
Often this is relaxed to mean that the reordering of instruction execution must not 
cause any new exceptions in the program. A simple example shows how main- 
taining the control and data dependences can prevent such situations. Consider 
this code sequence: 


DADDU R2,R3,R4 

BEQZ R2.L1 

LW R1,0(R2) 
LI: 


In this case, it is easy to see that if we do not maintain the data dependence 
involving R2, we can change the result of the program. Less obvious is the fact 
that if we ignore the control dependence and move the load instruction before the 
branch, the load instruction may cause a memory protection exception. Notice 
that no data dependence prevents us from interchanging the BRZ and the LW it is 
only the control dependence. To allow us to reorder these instructions (and still 
preserve the data dependence), we would like to just ignore the exception when 
the branch is taken. In Section 2.6, we will look at a hardware technique, specula- 
tion, which allows us to overcome this exception problem. Appendix G looks at 
software techniques for supporting speculation. 

The second property preserved by maintenance of data dependences and con- 
trol dependences is the data flow. The dataflow is the actual flow of data values 
among instructions that produce results and those that consume them. Branches 
make the data flow dynamic, since they allow the source of data for a given 
instruction to come from many points. Put another way, it is insufficient to just 
maintain data dependences because an instruction may be data dependent on 
more than one predecessor. Program order is what determines which predecessor 
will actually deliver a data value to an instruction. Program order is ensured by 
maintaining the control dependences. 

For example, consider the following code fragment: 


DADDU R1.R2.R3 

BEQZ R4,L 

DSUBU R1.R5.R6 
L: erae 

ŒR R7,R1,R8 


In this example, the value of RI used by the OR instruction depends on whether 
the branch is taken or not. Data dependence alone is not sufficient to preserve 
correctness. The OR instruction is data dependent on both the DAYU and DSUBU 
instructions, but preserving that order alone is insufficient for correct execution. 
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Instead, when the instructions execute, the data flow must be preserved: If the 
branch is not taken, then the value of RI computed by the DSUBU should be used 
by the OR, and if the branch is taken, the value of RI computed by the DATU 
should be used by the OR By preserving the control dependence of the OR on the 
branch, we prevent an illegal change to the data flow. For similar reasons, the 
DSUBU instruction cannot be moved above the branch. Speculation, which helps 
with the exception problem, will also allow us to lessen the impact of the control 
dependence while still maintaining the data flow, as we will see in Section 2.6. 

Sometimes we can determine that violating the control dependence cannot 
affect either the exception behavior or the data flow. Consider the following code 
sequence: 


DADDU R1.R2.R3 
BEQZ R12,skip 
DSUBU R4,R5,R6 
DADDU R5,R4,R9 
skip: OR R7,R8,R9 


Suppose we knew that the register destination of the DSUBU instruction (R4) was 
unused after the instruction labeled skip. (The property of whether a value will 
be used by an upcoming instruction is called liveness.) If R4 were unused, then 
changing the value of R4 just before the branch would not affect the data flow 
since R4 would be dead (rather than live) in the code region after ski p. Thus, if 
R4 were dead and the existing DSUBU instruction could not generate an exception 
(other than those from which the processor resumes the same process), we could 
move the DAU instruction before the branch, since the data flow cannot be 
affected by this change. 

If the branch is taken, the DSUBU instruction will execute and will be useless, 
but it will not affect the program results. This type of code scheduling is also a 
form of speculation, often called software speculation, since the compiler is bet- 
ting on the branch outcome; in this case, the bet is that the branch is usually not 
taken. More ambitious compiler speculation mechanisms are discussed in 
Appendix G. Normally, it will be clear when we say speculation or speculative 
whether the mechanism is a hardware or software mechanism; when it is not 
clear, it is best to say "hardware speculation" or "software speculation." 

Control dependence is preserved by implementing control hazard detection 
that causes control stalls. Control stalls can be eliminated or reduced by a variety 
of hardware and software techniques, which we examine in Section 2.3. 


2.2 Basic Compiler Techniques for Exposing ILP 


This section examines the use of simple compiler technology to enhance a pro- 
cessor's ability to exploit ILP. These techniques are crucial for processors that 
use static issue or static scheduling. Armed with this compiler technology, we 
will shortly examine the design and performance of processors using static issu- 
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ing. Appendix G will investigate more sophisticated compiler and associated 
hardware schemes designed to enable a processor to exploit more instruction- 
level parallelism. 


Basic Pipeline Scheduling and Loop Unrolling 


To keep a pipeline full, parallelism among instructions must be exploited by find- 
ing sequences of unrelated instructions that can be overlapped in the pipeline. To 
avoid a pipeline stall, a dependent instruction must be separated from the source 
instruction by a distance in clock cycles equal to the pipeline latency of that 
source instruction. A compiler's ability to perform this scheduling depends both 
on the amount of ILP available in the program and on the latencies of the 
functional units in the pipeline. Figure 2.2 shows the FP unit latencies we assume 
in this chapter, unless different latencies are explicitly stated. We assume the 
standard five-stage integer pipeline, so that branches have a delay of 1 clock 
cycle. We assume that the functional units are fully pipelined or replicated (as 
many times as the pipeline depth), so that an operation of any type can be issued 
on every clock cycle and there are no structural hazards. 

In this subsection, we look at how the compiler can increase the amount of 
available ILP by transforming loops. This example serves both to illustrate an 
important technique as well as to motivate the more powerful program transfor- 
mations described in Appendix G. We will rely on the following code segment, 
which adds a scalar to a vector: 


for (i = 1000; i>0; i =i-1) 
x[i] = x[i] + s; 


We can see that this loop is parallel by noticing that the body of each iteration is 
independent. We will formalize this notion in Appendix G and describe how we 
can test whether loop iterations are independent at compile time. First, let's look 
at the performance of this loop, showing how we can use the parallelism to 
improve its performance for a MIPS pipeline with the latencies shown above. 

















Instruction prod jcing result Instruction using result Latency in clock cycles 
FPALUop Another FP ALU op 3 
FPALUop Store double 2 
Load double FP ALU op 1 
Load double Store double 0 





Figure 2.2 Latencies of FP operations used in this chapter. The last column is the 
number of intervening clock cycles needed to avoid a stall.These numbers are similar 
to the average latencies we would see on an FP unit. The latency of a floating-point load 
to a store is 0, since the result of the load can be bypassed without stalling the store. We 
will continue to assume an integer load latency of 1 and an integer ALU operation 
latency of 0. 
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The first step is to translate the above segment to MIPS assembly language. In 
the following code segment, RI is initially the address of the element in the array 
with the highest address, and F2 contains the scalar value s. Register R2 is pre- 
compiled, so that 8(R2) is the address of the last element to operate on. 

The straightforward MIPS code, not scheduled for the pipeline, looks like 


this: 
Loop: L.D FO.O(RI) ;FO=array element 
ADD.D F4,F0,F2 ;add scalar in F2 
S.D F4,0(R1) ;store result 
DADDUI RI,RI,#-8 ;decrement pointer 
3 bytes (per DV 
BNE RI,R2,Loop sbranch R1!=R2 


Let's start by seeing how well this loop will run when it is scheduled on a 
simple pipeline for MIPS with the latencies from Figure 2.2. 


Example Show how the loop would look on MIPS, both scheduled and unscheduled, 
including any stalls or idle clock cycles. Schedule for delays from floating-point 
operations, but remember that we are ignoring delayed branches. 


Answer Without any scheduling, the loop will execute as follows, taking 9 cycles: 
Clock cycle issued 





Loop: L.D F0,0(R1) 1 
stall 2 
ADD.D F4.F0.F2 3 
stall 4 
stall 5 
S.D F4,0(R1) 6 
DADDUI RI,RI,#-8 7 
stall 8 
BNE RI,R2,Loop 9 


We can schedule the loop to obtain only two stalls and reduce the time to 7 
cycles: 
Loop: LD F0,0(R1) 
DADDUI RI, RI, #-8 
ADD.D F4.F0.F2 


Stall 

Stall 

S.D F4,8(R1) 
BNE RI,R2,Loop 


The stalls after ADD. D are for use by the S. D. 


Example 


Answer 
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In the previous example, we complete one loop iteration and store back one 
array element every 7 clock cycles, but the actual work of operating on the array 
element takes just 3 (the load, add, and store) of those 7 clock cycles. The 
remaining 4 clock cycles consist of loop overhead—the DADDU and BNE—and 
two stalls. To eliminate these 4 clock cycles we need to get more operations rela- 
tive to the number of overhead instructions. 

A simple scheme for increasing the number of instructions relative to the 
branch and overhead instructions is loop unrolling. Unrolling simply replicates 
the loop body multiple times, adjusting the loop termination code. 

Loop unrolling can also be used to improve scheduling. Because it eliminates 
the branch, it allows instructions from different iterations to be scheduled 
together. In this case, we can eliminate the data use stalls by creating additional 
independent instructions within the loop body. If we simply replicated the 
instructions when we unrolled the loop, the resulting use of the same registers 
could prevent us from effectively scheduling the loop. Thus, we will want to use 
different registers for each iteration, increasing the required number of registers. 


Show our loop unrolled so that there are four copies of the loop body, assuming 
RI - R2 (that is, the size of the array) is initially a multiple of 32, which means 
that the number of loop iterations is a multiple of 4. Eliminate any obviously 
redundant computations and do not reuse any of the registers. 


Here is the result after merging the DADDU instructions and dropping the unnec- 
essary BNE operations that are duplicated during unrolling. Note that R2 must now 
be set so that 32 (R2) is the starting address of the last four elements. 


Loop: L.D F0,0(R1) 
ADD.D F4.F0.F2 
S.D F4,0(R1) ;drop DADDUI & BNE 
L.D F6,-8(R1) 
ADD.D F8,F6,F2 
S.D F8,-8(R1) ;drop DADDUI & BNE 
L.D F10,-16(R1) 
ADD.D F12,F10,F2 
S.D F12,-16(R1) ;drop DADDUI & BNE 
L.D F14,-24(R1) 
ADD.D F16,F14,F2 
S.D F16,-24(R1) 
DADDUI RI, RI,#-32 
BNE RI,R2,Loop 


We have eliminated three branches and three decrements of RI. The addresses on 
the loads and stores have been compensated to allow the DADDUI instructions on 
RI to be merged. This optimization may seem trivial, but it is not; it requires sym- 
bolic substitution and simplification. Symbolic substitution and simplification 
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Example 


Answer 


will rearrange expressions so as to allow constants to be collapsed, allowing an 
expression such as "((«' + 1) + 1)" to be rewritten as "(i +(1 + 1))" and then simpli- 
fied to "(i + 2)." We will see more general forms of these optimizations that elim- 
inate dependent computations in Appendix G. 

Without scheduling, every operation in the unrolled loop is followed by 
a dependent operation and thus will cause a stall. This loop will run in 27 clock 
cycles—each LD has 1 stall, each ADDD 2, the DADDUI 1, plus 14 instruction issue 
cycles—or 6.75 clock cycles for each of the four elements, but it can be sched- 
uled to improve performance significantly. Loop unrolling is normally done early 
in the compilation process, so that redundant computations can be exposed and 
eliminated by the optimizer. 


In real programs we do not usually know the upper bound on the loop. Sup- 
pose it is n, and we would like to unroll the loop to make k copies of the body. 
Instead of a single unrolled loop, we generate a pair of consecutive loops. The 
first executes (n mod k) times and has a body that is the original loop. The second 
is the unrolled body surrounded by an outer loop that iterates inl/k) times. For 
large values of n, most of the execution time will be spent in the unrolled loop 
body. 

In the previous example, unrolling improves the performance of this loop by 
eliminating overhead instructions, although it increases code size substantially. 
How will the unrolled loop perform when it is scheduled for the pipeline 
described earlier? 


Show the unrolled loop in the previous example after it has been scheduled for 
the pipeline with the latencies shown in Figure 2.2. 


Loop: L.D FO,0(R1) 
L.D F6,-8(R1) 
L.D F10,-16(R1) 
L.D F14,-24(R1) 


ADD.D F4,F0,F2 
ADD.D F8,F6,F2 
ADD.D F12,F10,F2 
ADD.D F16.F14.F2 


S.D F4,0(R1) 
S.D F8,-8(R1) 
DADDUI R1,R1,#-32 
S.D F12,16(R1) 
S.D F16,8(R1) 
BNE RI,R2,Loop 


The execution time of the unrolled loop has dropped to a total of 14 clock cycles, 
or 3.5 clock cycles per element, compared with 9 cycles per element before any 
unrolling or scheduling and 7 cycles when scheduled but not unrolled. 
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The gain from scheduling on the unrolled loop is even larger than on the orig- 
inal loop. This increase arises because unrolling the loop exposes more computa- 
tion that can be scheduled to minimize the stalls; the code above has no stalls. 
Scheduling the loop in this fashion necessitates realizing that the loads and stores 
are independent and can be interchanged. 


Summary of the Loop Unrolling and Scheduling 


Throughout this chapter and Appendix G, we will look at a variety of hardware 
and software techniques that allow us to take advantage of instruction-level 
parallelism to fully utilize the potential of the functional units in a processor. 
The key to most of these techniques is to know when and how the ordering 
among instructions may be changed. In our example we made many such 
changes, which to us, as human beings, were obviously allowable. In practice, 
this process must be performed in a methodical fashion either by a compiler or 
by hardware. To obtain the final unrolled code we had to make the following 
decisions and transformations: 


e Determine that unrolling the loop would be useful by finding that the loop 
iterations were independent, except for the loop maintenance code. 


e Use different registers to avoid unnecessary constraints that would be forced 
by using the same registers for different computations. 


e Eliminate the extra test and branch instructions and adjust the loop termina- 
tion and iteration code. 


e Determine that the loads and stores in the unrolled loop can be interchanged 
by observing that the loads and stores from different iterations are indepen- 
dent. This transformation requires analyzing the memory addresses and find- 
ing that they do not refer to the same address. 


e Schedule the code, preserving any dependences needed to yield the same 
result as the original code. 


The key requirement underlying all of these transformations is an understanding 
of how one instruction depends on another and how the instructions can be 
changed or reordered given the dependences. 

There are three different types of limits to the gains that can be achieved by 
loop unrolling: a decrease in the amount of overhead amortized with each unroll, 
code size limitations, and compiler limitations. Let's consider the question of 
loop overhead first. When we unrolled the loop four times, it generated sufficient 
parallelism among the instructions that the loop could be scheduled with no stall 
cycles. In fact, in 14 clock cycles, only 2 cycles were loop overhead: the DADDUL 
which maintains the index value, and the BNE, which terminates the loop. If the 
loop is unrolled eight times, the overhead is reduced from 1/2 cycle per original 
iteration to 1/4. 
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A second limit to unrolling is the growth in code size that results. For larger 
loops, the code size growth may be a concern particularly if it causes an increase 
in the instruction cache miss rate. 

Another factor often more important than code size is the potential shortfall 
in registers that is created by aggressive unrolling and scheduling. This secondary 
effect that results from instruction scheduling in large code segments is called 
register pressure. It arises because scheduling code to increase ILP causes the 
number of live values to increase. After aggressive instruction scheduling, it may 
not be possible to allocate all the live values to registers. The transformed code, 
while theoretically faster, may lose some or all of its advantage because it gener- 
ates a shortage of registers. Without unrolling, aggressive scheduling is suffi- 
ciently limited by branches so that register pressure is rarely a problem. The 
combination of unrolling and aggressive scheduling can, however, cause this 
problem. The problem becomes especially challenging in multiple-issue proces- 
sors that require the exposure of more independent instruction sequences whose 
execution can be overlapped. In general, the use of sophisticated high-level trans- 
formations, whose potential improvements are hard to measure before detailed 
code generation, has led to significant increases in the complexity of modern 
compilers. 

Loop unrolling is a simple but useful method for increasing the size of 
straight-line code fragments that can be scheduled effectively. This transforma- 
tion is useful in a variety of processors, from simple pipelines like those we have 
examined so far to the multiple-issue superscalars and VLIWs explored later in 
this chapter. 


Reducing Branch Costs with Prediction 


Because of the need to enforce control dependences through branch hazards and 
stalls, branches will hurt pipeline performance. Loop unrolling is one way to 
reduce the number of branch hazards; we can also reduce the performance losses 
of branches by predicting how they will behave. 

The behavior of branches can be predicted both statically at compile time and 
dynamically by the hardware at execution time. Static branch predictors are 
sometimes used in processors where the expectation is that branch behavior is 
highly predictable at compile time; static prediction can also be used to assist 
dynamic predictors. 


Static Branch Prediction 


In Appendix A, we discuss an architectural feature that supports static branch 
prediction, namely, delayed branches. Being able to accurately predict a branch 
at compile time is also helpful for scheduling data hazards. Loop unrolling is 
another example of a technique for improving code scheduling that depends on 
predicting branches. 
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To reorder code around branches so that it runs faster, we need to predict the 
branch statically when we compile the program. There are several different meth- 
ods to statically predict branch behavior. The simplest scheme is to predict a 
branch as taken. This scheme has an average misprediction rate that is equal to 
the untaken branch frequency, which for the SPEC programs is 34%. Unfortu- 
nately, the misprediction rate for the SPEC programs ranges from not very accu- 
rate (59%) to highly accurate (9%). 

A more accurate technique is to predict branches on the basis of profile infor- 
mation collected from earlier runs. The key observation that makes this worth- 
while is that the behavior of branches is often bimodally distributed; that is, an 
individual branch is often highly biased toward taken or untaken. Figure 2.3 
shows the success of branch prediction using this strategy. The same input data 
were used for runs and for collecting the profile; other studies have shown that 
changing the input so that the profile is for a different run leads to only a small 
change in the accuracy of profile-based prediction. 

The effectiveness of any branch prediction scheme depends both on the accu- 
racy of the scheme and the frequency of conditional branches, which vary in 
SPEC from 3% to 24%. The fact that the misprediction rate for the integer pro- 
grams is higher and that such programs typically have a higher branch frequency 
is a major limitation for static branch prediction. In the next section, we consider 
dynamic branch predictors, which most recent processors have employed. 
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Figure 2.3 Misprediction rate on SPEC92 for a profile-based predictor varies widely 
but is generally better for the FP programs, which have an average misprediction 
rate of 9% with a standard deviation of 4%, than for the integer programs, which 
have an average misprediction rate of 15% with a standard deviation of 5%. The 
actual performance depends on both the prediction accuracy and the branch fre- 
quency, which vary from 3% to 24%. 
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Dynamic Branch Prediction and Branch-Prediction Buffers 


The simplest dynamic branch-prediction scheme is a branch-prediction buffer or 
branch history table. A branch-prediction buffer is a small memory indexed by 
the lower portion of the address of the branch instruction. The memory contains a 
bit that says whether the branch was recently taken or not. This scheme is the 
simplest sort of buffer; it has no tags and is useful only to reduce the branch delay 
when it is longer than the time to compute the possible target PCs. 

With such a buffer, we don't know, in fact, if the prediction is correct—it may 
have been put there by another branch that has the same low-order address bits. 
But this doesn't matter. The prediction is a hint that is assumed to be correct, and 
fetching begins in the predicted direction. If the hint turns out to be wrong, the 
prediction bit is inverted and stored back. 

This buffer is effectively a cache where every access is a hit, and, as we will 
see, the performance of the buffer depends on both how often the prediction is for 
the branch of interest and how accurate the prediction is when it matches. Before 
we analyze the performance, it is useful to make a small, but important, improve- 
ment in the accuracy of the branch-prediction scheme. 

This simple 1-bit prediction scheme has a performance shortcoming: Even if 
a branch is almost always taken, we will likely predict incorrectly twice, rather 
than once, when it is not taken, since the misprediction causes the prediction bit 
to be flipped. 

To remedy this weakness, 2-bit prediction schemes are often used. In a 2-bit 
scheme, a prediction must miss twice before it is changed. Figure 2.4 shows the 
finite-state processor for a 2-bit prediction scheme. 

A branch-prediction buffer can be implemented as a small, special "cache" 
accessed with the instruction address during the IF pipe stage, or as a pair of bits 
attached to each block in the instruction cache and fetched with the instruction. If 
the instruction is decoded as a branch and if the branch is predicted as taken, 
fetching begins from the target as soon as the PC is known. Otherwise, sequential 
fetching and executing continue. As Figure 2.4 shows, if the prediction turns out 
to be wrong, the prediction bits are changed. 

What kind of accuracy can be expected from a branch-prediction buffer using 
2 bits per entry on real applications? Figure 2.5 shows that for the SPEC89 
benchmarks a branch-prediction buffer with 4096 entries results in a prediction 
accuracy ranging from over 99% to 82%, or a misprediction rate of 1% to 18%. A 
4K entry buffer, like that used for these results, is considered small by 2005 stan- 
dards, and a larger buffer could produce somewhat better results. 

AS we try to exploit more ILP, the accuracy of our branch prediction becomes 
critical. As we can see in Figure 2.5, the accuracy of the predictors for integer 
programs, which typically also have higher branch frequencies, is lower than for 
the loop-intensive scientific programs. We can attack this problem in two ways: 
by increasing the size of the buffer and by increasing the accuracy of the scheme 
we use for each prediction. A buffer with 4K entries, however, as Figure 2.6 
shows, performs quite comparably to an infinite buffer, at least for benchmarks 
like those in SPEC. The data in Figure 2.6 make it clear that the hit rate of the 
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Figure 2.4 The states in a 2-bit prediction scheme. By using 2 bits rather than 1, a 
branch that strongly favors taken or not taken—as many branches do—uwill be mispre- 
dicted less often than with a 1-bit predictor. The 2 bits are used to encode the four 
states in the system. The 2-bit scheme is actually a specialization of a more general 
scheme that has an n-bit saturating counter for each entry in the prediction buffer. With 
an n-bit counter, the counter can take on values between 0 and 2”- 1: When the 
counter is greater than or equal to one-half of its maximum value (2" - 1), the branch is 
predicted as taken; otherwise, it is predicted untaken. Studies of n-bit predictors have 
shown that the 2-bit predictors do almost as well, and thus most systems rely on 2-bit 
branch predictors rather than the more general n-bit predictors. 


buffer is not the major limiting factor. As we mentioned above, simply increasing 
the number of bits per predictor without changing the predictor structure also has 
little impact. Instead, we need to look at how we might increase the accuracy of 
each predictor. 


Correlating Branch Predictors 


The 2-bit predictor schemes use only the recent behavior of a single branch to 
predict the future behavior of that branch. It may be possible to improve the pre- 
diction accuracy if we also look at the recent behavior of other branches rather 
than just the branch we are trying to predict. Consider a small code fragment 
from the eqntott benchmark, a member of early SPEC benchmark suites that dis- 
played particularly bad branch prediction behavior: 


if (aa==2) 
aa=0; 

if (bb==2) 
bb=0; 


if (aal!=bb) { 
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Figure 2.5 Prediction accuracy of a 4096-entry 2-bit prediction buffer for the 
SPEC89 benchmarks. The misprediction rate for the integer benchmarks (gcc, 
espresso, eqntott, and li) is substantially higher (average of 11%) than that for the FP 
programs (average of 4%). Omitting the FP kernels (nasa7, matrix300,and tomcatv) still 
yields a higher accuracy for the FP benchmarks than for the integer benchmarks. These 
data, as well as the rest of the data in this section, are taken from a branch-prediction 
study done using the IBM Power architecture and optimized code for that system. See 
Pan, So, and Rameh [1992]. Although this data is for an older version of a subset of the 
SPEC benchmarks, the newer benchmarks are larger and would show slightly worse 
behavior, especially for the integer benchmarks. 


Here is the MIPS code that we would typically generate for this code frag- 
ment assuming that aa and bb are assigned to registers RI and R2: 

















DADDIU R3,R1, #-2 
BNEZ R3,11 jbranch bl (aa!=2) 
DADD R1.RO.RO ;aa=0 

L1: DADDIU R3,R2, #-2 
BNEZ R3, L2 ;branch b2 (bb!=2) 
DADD R2, R0, R0 ;bb=0 

L2: DSUBU R3.R1.R2 ;R3=aa-bb 
BEQZ R3.L3 ;branch b3 (aa==bb) 


Let's label these branches b1, b2, and b3. The key observation is that the behavior 
of branch b3 is correlated with the behavior of branches bl and b2. Clearly, if 
branches bl and b2 are both not taken (i.e., if the conditions both evaluate to true 
and aa and bb are both assigned 0), then b3 will be taken, since aa and bb are 
clearly equal. A predictor that uses only the behavior of a single branch to predict 
the outcome of that branch can never capture this behavior. 
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Figure 2.6 Prediction accuracy of a 4096-entry 2-bit prediction buffer versus an infi- 
nite buffer for the SPEC89 benchmarks. Although this data is for an older version of a 
subset of the SPEC benchmarks, the results would be comparable for newer versions 
with perhaps as many as 8K entries needed to match an infinite 2-bit predictor. 


Branch predictors that use the behavior of other branches to make a predic- 
tion are called correlating predictors or two-level predictors. Existing correlating 
predictors add information about the behavior of the most recent branches to 
decide how to predict a given branch. For example, a (1,2) predictor uses the 
behavior of the last branch to choose from among a pair of 2-bit branch predic- 
tors in predicting a particular branch. In the general case an (m,n) predictor uses 
the behavior of the last m branches to choose from 2” branch predictors, each of 
which is an n-bit predictor for a single branch. The attraction of this type of cor- 
relating branch predictor is that it can yield higher prediction rates than the 2-bit 
scheme and requires only a trivial amount of additional hardware. 

The simplicity of the hardware comes from a simple observation: The global 
history of the most recent m branches can be recorded in an m-bit shift register, 
where each bit records whether the branch was taken or not taken. The branch- 
prediction buffer can then be indexed using a concatenation of the low-order bits 
from the branch address with the m-bit global history. For example, in a (2,2) 
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Example 


Answer 


buffer with 64 total entries, the 4 low-order address bits of the branch (word 
address) and the 2 global bits representing the behavior of the two most recently 
executed branches form a 6-bit index that can be used to index the 64 counters. 

How much better do the correlating branch predictors work when compared 
with the standard 2-bit scheme? To compare them fairly, we must compare 
predictors that use the same number of state bits. The number of bits in an (m,n) 
predictor is 


2™ x n x Number of prediction entries selected by the branch address 


A 2-bit predictor with no global history is simply a (0,2) predictor. 


How many bits are in the (0,2) branch predictor with 4K entries? How many 
entries are in a (2,2) predictor with the same number of bits? 


The predictor with 4K entries has 


2° x 2 x 4K = 8K bits 


How many branch-selected entries are in a (2,2) predictor that has a total of 8K 
bits in the prediction buffer? We know that 


2 x 2 x Number of prediction entries selected by the branch = 8K 


Hence, the number of prediction entries selected by the branch = IK. 


Figure 2.7 compares the misprediction rates of the earlier (0,2) predictor with 
4K entries and a (2,2) predictor with IK entries. As you can see, this correlating 
predictor not only outperforms a simple 2-bit predictor with the same total num- 
ber of state bits, it often outperforms a 2-bit predictor with an unlimited number 
of entries. 


Tournament Predictors: Adaptively Combining Local and 
Global Predictors 


The primary motivation for correlating branch predictors came from the observa- 
tion that the standard 2-bit predictor using only local information failed on some 
important branches and that, by adding global information, the performance 
could be improved. Tournament predictors take this insight to the next level, by 
using multiple predictors, usually one based on global information and one based 
on local information, and combining them with a selector. Tournament predictors 
can achieve both better accuracy at medium sizes (8K-32K bits) and also make 
use of very large numbers of prediction bits effectively. Existing tournament pre- 
dictors use a 2-bit saturating counter per branch to choose among two different 
predictors based on which predictor (local, global, or even some mix) was most 
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Figure 2.7 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is 
first, followed by a noncorrelating 2-bit predictor with unlimited entries and a 2-bit pre- 
dictor with 2 bits of global history and a total of 1024 entries. Although this data is for 
an older version of SPEC, data for more recent SPEC benchmarks would show similar 
differences in accuracy. 


effective in recent predictions. As in a simple 2-bit predictor, the saturating 
counter requires two mispredictions before changing the identity of the preferred 
predictor. 

The advantage of a tournament predictor is its ability to select the right pre- 
dictor for a particular branch, which is particularly crucial for the integer bench- 
marks. A typical tournament predictor will select the global predictor almost 40% 
of the time for the SPEC integer benchmarks and less than 15% of the time for 
the SPEC FP benchmarks. 

Figure 2.8 looks at the performance of three different predictors (a local 2-bit 
predictor, a correlating predictor, and a tournament predictor) for different num- 
bers of bits using SPEC89 as the benchmark. As we saw earlier, the prediction 
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gure 2.8 The misprediction rate for three different predictors on SPEC89 as the total number of bits is 
creased. The predictors are a local 2-bit predictor, a correlating predictor, which is optimally structured in its use of 
obal and local information at each point in the graph, and a tournament predictor. Although this data is for an 
der version of SPEC, data for more recent SPEC benchmarks would show similar behavior, perhaps converging to 
e asymptotic limit at slightly larger predictor sizes. 


capability of the local predictor does not improve beyond a certain size. The cor- 
relating predictor shows a significant improvement, and the tournament predictor 
generates slightly better performance. For more recent versions of the SPEC, the 
results would be similar, but the asymptotic behavior would not be reached until 
slightly larger-sized predictors. 

In 2005, tournament predictors using about 30K bits are the standard in 
processors like the Power5 and Pentium 4. The most advanced of these predic- 
tors has been on the Alpha 21264, although both the Pentium 4 and Power5 
predictors are similar. The 21264's tournament predictor uses 4K 2-bit counters 
indexed by the local branch address to choose from among a global predictor 
and a local predictor. The global predictor also has 4K entries and is indexed by 
the history of the last 12 branches; each entry in the global predictor is a stan- 
dard 2-bit predictor. 

The local predictor consists of a two-level predictor. The top level is a local 
history table consisting of 1024 10-bit entries; each 10-bit entry corresponds to 
the most recent 10 branch outcomes for the entry. That is, if the branch was taken 
10 or more times in a row, the entry in the local history table will be all Is. If the 
branch is alternately taken and untaken, the history entry consists of alternating 
Os and Is. This 10-bit history allows patterns of up to 10 branches to be discov- 
ered and predicted. The selected entry from the local history table is used to 
index a table of IK entries consisting of 3-bit saturating counters, which provide 
the local prediction. This combination, which uses a total of 29K bits, leads to 
high accuracy in branch prediction. 


2.4 
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To examine the effect on performance, we need to know the prediction accu- 
racy as well as the branch frequency, since the importance of accurate prediction 
is larger in programs with higher branch frequency. For example, the integer pro- 
grams in the SPEC suite have higher branch frequencies than those of the more 
easily predicted FP programs. For the 21264's predictor, the SPECfp95 bench- 
marks have less than 1 misprediction per 1000 completed instructions, and for 
SPECint95, there are about 11.5 mispredictions per 1000 completed instructions. 
This corresponds to misprediction rates of less than 0.5% for the floating-point 
programs and about 14% for the integer programs. 

Later versions of SPEC contain programs with larger data sets and larger 
code, resulting in higher miss rates. Thus, the importance of branch prediction 
has increased. In Section 2.11, we will look at the performance of the Pentium 4 
branch predictor on programs in the SPEC2000 suite and see that, despite more 
aggressive branch prediction, the branch-prediction miss rates for the integer pro- 
grams remain significant. 


Overcoming Data Hazards with Dynamic Scheduling 


A simple statically scheduled pipeline fetches an instruction and issues it, unless 
there was a data dependence between an instruction already in the pipeline and 
the fetched instruction that cannot be hidden with bypassing or forwarding. (For- 
warding logic reduces the effective pipeline latency so that the certain depen- 
dences do not result in hazards.) If there is a data dependence that cannot be 
hidden, then the hazard detection hardware stalls the pipeline starting with the 
instruction that uses the result. No new instructions are fetched or issued until the 
dependence is cleared. 

In this section, we explore dynamic scheduling, in which the hardware rear- 
ranges the instruction execution to reduce the stalls while maintaining data flow 
and exception behavior. Dynamic scheduling offers several advantages: It 
enables handling some cases when dependences are unknown at compile time 
(for example, because they may involve a memory reference), and it simplifies 
the compiler. Perhaps most importantly, it allows the processor to tolerate unpre- 
dictable delays such as cache misses, by executing other code while waiting for 
the miss to resolve. Almost as importantly, dynamic scheduling allows code that 
was compiled with one pipeline in mind to run efficiently on a different pipeline. 
In Section 2.6, we explore hardware speculation, a technique with significant per- 
formance advantages, which builds on dynamic scheduling. As we will see, the 
advantages of dynamic scheduling are gained at a cost of a significant increase in 
hardware complexity. 

Although a dynamically scheduled processor cannot change the data flow, it 
tries to avoid stalling when dependences are present. In contrast, static pipeline 
scheduling by the compiler (covered in Section 2.2) tries to minimize stalls by 
separating dependent instructions so that they will not lead to hazards. Of course, 
compiler pipeline scheduling can also be used on code destined to run on a pro- 
cessor with a dynamically scheduled pipeline. 
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Dynamic Scheduling:The Idea 


A major limitation of simple pipelining techniques is that they use in-order 
instruction issue and execution: Instructions are issued in program order, and if 
an instruction is stalled in the pipeline, no later instructions can proceed. Thus, if 
there is a dependence between two closely spaced instructions in the pipeline, 
this will lead to a hazard and a stall will result. If there are multiple functional 
units, these units could lie idle. If instruction j depends on a long-running instruc- 
tion i, currently in execution in the pipeline, then all instructions after j must be 
stalled until į is finished and 7 can execute. For example, consider this code: 


DIV.D FO,F2,F4 
ADD.D F10,F0,F8 
SUB.D F12,F8,F14 


The SUBD instruction cannot execute because the dependence of ADDD on 
DIV.D causes the pipeline to stall; yet SUB. D is not data dependent on anything in 
the pipeline. This hazard creates a performance limitation that can be eliminated 
by not requiring instructions to execute in program order. 

In the classic five-stage pipeline, both structural and data hazards could be 
checked during instruction decode (ID): When an instruction could execute with- 
out hazards, it was issued from ID knowing that all data hazards had been 
resolved. 

To allow us to begin executing the SUB. D in the above example, we must sep- 
arate the issue process into two parts: checking for any structural hazards and 
waiting for the absence of a data hazard. Thus, we still use in-order instruction 
issue (i.e., instructions issued in program order), but we want an instruction to 
begin execution as soon as its data operands are available. Such a pipeline does 
out-of-order execution, which implies out-of-order completion. 

Out-of-order execution introduces the possibility of WAR and WAW hazards, 
which do not exist in the five-stage integer pipeline and its logical extension to an 
in-order floating-point pipeline. Consider the following MIPS floating-point code 
sequence: 


DIV.D FO,F2,F4 
ADD.D F6,F0,F8 
SUB.D F8,F10,F14 
MUL.D F6,F10,F8 


There is an antidependence between the ADD. D and the SUBD, and if the pipeline 
executes the SUB. D before the ADD. D (which is waiting for the DIV. D), it will vio- 
late the antidependence, yielding a WAR hazard. Likewise, to avoid violating 
output dependences, such as the write of F6 by MULD, WAW hazards must be 
handled. As we will see, both these hazards are avoided by the use of register 
renaming. 

Out-of-order completion also creates major complications in handling excep- 
tions. Dynamic scheduling with out-of-order completion must preserve exception 
behavior in the sense that exactly those exceptions that would arise if the program 
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were executed in strict program order actually do arise. Dynamically scheduled 
processors preserve exception behavior by ensuring that no instruction can gener- 
ate an exception until the processor knows that the instruction raising the excep- 
tion will be executed; we will see shortly how this property can be guaranteed. 

Although exception behavior must be preserved, dynamically scheduled pro- 
cessors may generate imprecise exceptions. An exception is imprecise if the pro- 
cessor state when an exception is raised does not look exactly as if the 
instructions were executed sequentially in strict program order. Imprecise excep- 
tions can occur because of two possibilities: 


1. The pipeline may have already completed instructions that are later in pro- 
gram order than the instruction causing the exception. 


2. The pipeline may have not yet completed some instructions that are earlier in 
program order than the instruction causing the exception. 


Imprecise exceptions make it difficult to restart execution after an exception. 
Rather than address these problems in this section, we will discuss a solution that 
provides precise exceptions in the context of a processor with speculation in Sec- 
tion 2.6. For floating-point exceptions, other solutions have been used, as dis- 
cussed in Appendix J. 

To allow out-of-order execution, we essentially split the ID pipe stage of our 
simple five-stage pipeline into two stages: 


1. Issue—Decode instructions, check for structural hazards. 


2. Read operands—Wait until no data hazards, then read operands. 


An instruction fetch stage precedes the issue stage and may fetch either into an 
instruction register or into a queue of pending instructions; instructions are then 
issued from the register or queue. The EX stage follows the read operands stage, 
just as in the five-stage pipeline. Execution may take multiple cycles, depending 
on the operation. 

We distinguish when an instruction begins execution and when it completes 
execution; between the two times, the instruction is in execution. Our pipeline 
allows multiple instructions to be in execution at the same time, and without this 
capability, a major advantage of dynamic scheduling is lost. Having multiple 
instructions in execution at once requires multiple functional units, pipelined 
functional units, or both. Since these two capabilities—pipelined functional units 
and multiple functional units—are essentially equivalent for the purposes of 
pipeline control, we will assume the processor has multiple functional units. 

In a dynamically scheduled pipeline, all instructions pass through the issue 
stage in order (in-order issue); however, they can be stalled or bypass each other 
in the second stage (read operands) and thus enter execution out of order. Score- 
boarding is a technique for allowing instructions to execute out of order when 
there are sufficient resources and no data dependences; it is named after the CDC 
6600 scoreboard, which developed this capability, and we discuss it in Appendix 
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A. Here, we focus on a more sophisticated technique, called Tomasulo's algo- 
rithm, that has several major enhancements over scoreboarding. 


Dynamic Scheduling Using Tomasulo's Approach 


The IBM 360/91 floating-point unit used a sophisticated scheme to allow out-of- 
order execution. This scheme, invented by Robert Tomasulo, tracks when oper- 
ands for instructions are available, to minimize RAW hazards, and introduces 
register renaming, to minimize WAW and WAR hazards. There are many varia- 
tions on this scheme in modern processors, although the key concepts of tracking 
instruction dependences to allow execution as soon as operands are available and 
renaming registers to avoid WAR and WAW hazards are common characteristics. 

IBM's goal was to achieve high floating-point performance from an instruc- 
tion set and from compilers designed for the entire 360 computer family, rather 
than from specialized compilers for the high-end processors. The 360 architec- 
ture had only four double-precision floating-point registers, which limits the 
effectiveness of compiler scheduling; this fact was another motivation for the 
Tomasulo approach. In addition, the IBM 360/91 had long memory accesses and 
long floating-point delays, which Tomasulo's algorithm was designed to overcome. 
At the end of the section, we will see that Tomasulo's algorithm can also support the 
overlapped execution of multiple iterations of a loop. 

We explain the algorithm, which focuses on the floating-point unit and load- 
store unit, in the context of the MIPS instruction set. The primary difference 
between MIPS and the 360 is the presence of register-memory instructions in the 
latter architecture. Because Tomasulo's algorithm uses a load functional unit, no 
significant changes are needed to add register-memory addressing modes. The 
IBM 360/91 also had pipelined functional units, rather than multiple functional 
units, but we describe the algorithm as if there were multiple functional units. It 
is a simple conceptual extension to also pipeline those functional units. 

As we will see, RAW hazards are avoided by executing an instruction only 
when its operands are available. WAR and WAW hazards, which arise from name 
dependences, are eliminated by register renaming. Register renaming eliminates 
these hazards by renaming all destination registers, including those with a pend- 
ing read or write for an earlier instruction, so that the out-of-order write does not 
affect any instructions that depend on an earlier value of an operand. 

To better understand how register renaming eliminates WAR and WAW haz- 
ards, consider the following example code sequence that includes both a potential 


WAR and WAW hazard: 
DIV.D F0.F2.F4 
ADD.D F6,F0,F8 
S.D F6,0(R1) 
SUB.D F8,F10,F14 


MULD F6,F10,F8 
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There is an antidependence between the ADDD and the SUBD and an output 
dependence between the ADDD and the MilL.D, leading to two possible hazards: a 
WAR hazard on the use of F8 by ADD. D and a WAW hazard since the ADD. D may 
finish later than the MULD. There are also three true data dependences: between 
the DIV.D and the ADDD, between the SUB.D and the MULD, and between the 
ADDD and the S.D. 

These two name dependences can both be eliminated by register renaming. 
For simplicity, assume the existence of two temporary registers, S and T. Using S 
and T, the sequence can be rewritten without any dependences as 


DIV.D FO,F2,F4 
ADD.D S,FO,F8 
S.D S,0(R1) 
SUB.D T.F10.F14 
MULD F6,F10,T 


In addition, any subsequent uses of F8 must be replaced by the register T. In this 
code segment, the renaming process can be done statically by the compiler. Find- 
ing any uses of F8 that are later in the code requires either sophisticated compiler 
analysis or hardware support, since there may be intervening branches between 
the above code segment and a later use of F8. As we will see, Tomasulo's algo- 
rithm can handle renaming across branches. 

In Tomasulo's scheme, register renaming is provided by reservation stations, 
which buffer the operands of instructions waiting to issue. The basic idea is that a 
reservation station fetches and buffers an operand as soon as it is available, elimi- 
nating the need to get the operand from a register. In addition, pending instruc- 
tions designate the reservation station that will provide their input. Finally, when 
successive writes to a register overlap in execution, only the last one is actually 
used to update the register. As instructions are issued, the register specifiers for 
pending operands are renamed to the names of the reservation station, which pro- 
vides register renaming. 

Since there can be more reservation stations than real registers, the technique 
can even eliminate hazards arising from name dependences that could not be 
eliminated by a compiler. As we explore the components of Tomasulo's scheme, 
we will return to the topic of register renaming and see exactly how the renaming 
occurs and how it eliminates WAR and WAW hazards. 

The use of reservation stations, rather than a centralized register file, leads to 
two other important properties. First, hazard detection and execution control are 
distributed: The information held in the reservation stations at each functional 
unit determine when an instruction can begin execution at that unit. Second, 
results are passed directly to functional units from the reservation stations where 
they are buffered, rather than going through the registers. This bypassing is done 
with a common result bus that allows all units waiting for an operand to be 
loaded simultaneously (on the 360/91 this is called the common data bus, or 
CDB). In pipelines with multiple execution units and issuing multiple instruc- 
tions per clock, more than one result bus will be needed. 


)4 IB Chapter Two Instruction-Level Parallelism and Its Exploitation 


Figure 2.9 shows the basic structure of a Tomasulo-based processor, includ- 
ing both the floating-point unit and the load-store unit; none of the execution con- 
trol tables are shown. Each reservation station holds an instruction that has been 
issued and is awaiting execution at a functional unit, and either the operand val- 
ues for that instruction, if they have already been computed, or else the names of 
the reservation stations that will provide the operand values. 

The load buffers and store buffers hold data or addresses coming from and 
going to memory and behave almost exactly like reservation stations, so we dis- 
tinguish them only when necessary. The floating-point registers are connected by 
a pair of buses to the functional units and by a single bus to the store buffers. All 
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Figure 2.9 The basic structure of a MIPS floating-point unit using Tomasulo's algo- 
rithm. Instructions are sent from the instruction unit into the instruction queue from 
which they are issued in FIFO order.The reservation stations include the operation and 
the actual operands, as well as information used for detecting and resolving hazards. 
Load buffers have three functions: hold the components of the effective address until it 
is computed, track outstanding loads that are waiting on the memory, and hold the 
results of completed loads that are waiting for the CDB. Similarly, store buffers have 
three functions: hold the components of the effective address until it is computed, hold 
the destination memory addresses of outstanding stores that are waiting for the data 
value to store, and hold the address and value to store until the memory unit is avail- 
able. All results from either the FP units or the load unit are put on the CDB, which goes 
to the FP register file as well as to the reservation stations and store buffers. The FP 
adders implement addition and subtraction, and the FP multipliers do multiplication 
and division. 


pliers |) 
) 


Common data bus (CDB 


2.4 Overcoming Data Hazards with Dynamic Scheduling • 95 


results from the functional units and from memory are sent on the common data 
bus, which goes everywhere except to the load buffer. All reservation stations 
have tag fields, employed by the pipeline control. 


Before we describe the details of the reservation stations and the algorithm, 


let's look at the steps an instruction goes through. There are only three steps, 
although each one can now take an arbitrary number of clock cycles: 


1. Issue—Get the next instruction from the head of the instruction queue, which 


2. 


is maintained in FIFO order to ensure the maintenance of correct data flow. If 
there is a matching reservation station that is empty, issue the instruction to 
the station with the operand values, if they are currently in the registers. If 
there is not an empty reservation station, then there is a structural hazard and 
the instruction stalls until a station or buffer is freed. If the operands are not in 
the registers, keep track of the functional units that will produce the operands. 
This step renames registers, eliminating WAR and WAW hazards. (This stage 
is sometimes called dispatch in a dynamically scheduled processor.) 


Execute—If one or more of the operands is not yet available, monitor the 
common data bus while waiting for it to be computed. When an operand 
becomes available, it is placed into any reservation station awaiting it. When 
all the operands are available, the operation can be executed at the corre- 
sponding functional unit. By delaying instruction execution until the oper- 
ands are available, RAW hazards are avoided. (Some dynamically scheduled 
processors call this step "issue," but we use the name "execute," which was 
used in the first dynamically scheduled processor, the CDC 6600.) 

Notice that several instructions could become ready in the same clock 
cycle for the same functional unit. Although independent functional units 
could begin execution in the same clock cycle for different instructions, if 
more than one instruction is ready for a single functional unit, the unit will 
have to choose among them. For the floating-point reservation stations, this 
choice may be made arbitrarily; loads and stores, however, present an addi- 
tional complication. 

Loads and stores require a two-step execution process. The first step com- 
putes the effective address when the base register is available, and the effec- 
tive address is then placed in the load or store buffer. Loads in the load buffer 
execute as soon as the memory unit is available. Stores in the store buffer wait 
for the value to be stored before being sent to the memory unit. Loads and 
stores are maintained in program order through the effective address calcula- 
tion, which will help to prevent hazards through memory, as we will see 
shortly. 

To preserve exception behavior, no instruction is allowed to initiate exe- 
cution until all branches that precede the instruction in program order have 
completed. This restriction guarantees that an instruction that causes an 
exception during execution really would have been executed. In a processor 
using branch prediction (as all dynamically scheduled processors do), this 
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means that the processor must know that the branch prediction was correct 
before allowing an instruction after the branch to begin execution. If the pro- 
cessor records the occurrence of the exception, but does not actually raise it, 
an instruction can start execution but not stall until it enters Write Result. 

As we will see, speculation provides a more flexible and more complete 
method to handle exceptions, so we will delay making this enhancement and 
show how speculation handles this problem later. 


3. Write result—When the result is available, write it on the CDB and from 
there into the registers and into any reservation stations (including store buff- 
ers) waiting for this result. Stores are buffered in the store buffer until both the 
value to be stored and the store address are available, then the result is written 
as soon as the memory unit is free. 


The data structures that detect and eliminate hazards are attached to the reser- 
vation stations, to the register file, and to the load and store buffers with slightly 
different information attached to different objects. These tags are essentially 
names for an extended set of virtual registers used for renaming. In our example, 
the tag field is a 4-bit quantity that denotes one of the five reservation stations or 
one of the five load buffers. As we will see, this produces the equivalent of 10 
registers that can be designated as result registers (as opposed to the 4 double- 
precision registers that the 360 architecture contains). In a processor with more 
real registers, we would want renaming to provide an even larger set of virtual 
registers. The tag field describes which reservation station contains the instruc- 
tion that will produce a result needed as a source operand. 

Once an instruction has issued and is waiting for a source operand, it refers to 
the operand by the reservation station number where the instruction that will 
write the register has been assigned. Unused values, such as zero, indicate that 
the operand is already available in the registers. Because there are more reserva- 
tion stations than actual register numbers, WAW and WAR hazards are eliminated 
by renaming results using reservation station numbers. Although in Tomasulo's 
scheme the reservation stations are used as the extended virtual registers, other 
approaches could use a register set with additional registers or a structure like the 
reorder buffer, which we will see in Section 2.6. 

In Tomasulo's scheme, as well as the subsequent methods we look at for sup- 
porting speculation, results are broadcasted on a bus (the CDB), which is moni- 
tored by the reservation stations. The combination of the common result bus and 
the retrieval of results from the bus by the reservation stations implements the 
forwarding and bypassing mechanisms used in a statically scheduled pipeline. In 
doing so, however, a dynamically scheduled scheme introduces one cycle of 
latency between source and result, since the matching of a result and its use can- 
not be done until the Write Result stage. Thus, in a dynamically scheduled pipe- 
line, the effective latency between a producing instruction and a consuming 
instruction is at least one cycle longer than the latency of the functional unit pro- 
ducing the result. 


Example 
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In describing the operation of this scheme, we use a terminology taken from 
the CDC scoreboard scheme (see Appendix A) rather than introduce new termi- 
nology, showing the terminology used by the IBM 360/91 for historical refer- 
ence. It is important to remember that the tags in the Tomasulo scheme refer to 
the buffer or unit that will produce a result; the register names are discarded when 
an instruction issues to a reservation station. 

Each reservation station has seven fields: 


e Op—tThe operation to perform on source operands SI and S2. 


e Qj, Qk—The reservation stations that will produce the corresponding source 
operand; a value of zero indicates that the source operand is already available 
in Vj or Vk, or is unnecessary. (The IBM 360/91 calls these SINKunit and 
SOURCE unit.) 


¢ Vj, Vk—The value of the source operands. Note that only one of the V field 
or the Q field is valid for each operand. For loads, the Vk field is used to 
hold the offset field. (These fields are called SINK and SOURCE on the 
IBM 360/91.) 


e A—Used to hold information for the memory address calculation for a load 
or store. Initially, the immediate field of the instruction is stored here; after 
the address calculation, the effective address is stored here. 


e Busy—Indicates that this reservation station and its accompanying functional 
unit are occupied. 


The register file has a field, Qi: 


e Qi—The number of the reservation station that contains the operation whose 
result should be stored into this register. If the value of Qi is blank (or 0), no 
currently active instruction is computing a result destined for this register, 
meaning that the value is simply the register contents. 


The load and store buffers each have a field, A, which holds the result of the 
effective address once the first step of execution has been completed. 

In the next section, we will first consider some examples that show how these 
mechanisms work and then examine the detailed algorithm. 


Dynamic Scheduling: Examples and the Algorithm 


Before we examine Tomasulo's algorithm in detail, let's consider a few exam- 
ples, which will help illustrate how the algorithm works. 


Show what the information tables look like for the following code sequence 
when only the first load has completed and written its result: 
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Answer 


1. L.D F6,32(R2) 
2. L.D F2,44(R3) 
3. MUL.D F0,F2,F4 
4. SUB.D F8,F2,F6 
5. DIV.D F10,F0,F6 
6. ADD.D F6,F8,F2 


Figure 2.10 shows the result in three tables. The numbers appended to the names 
add, mult, and load stand for the tag for that reservation station—Addl is the tag 
for the result from the first add unit. In addition we have included an instruction 
status table. This table is included only to help you understand the algorithm; it is 
not actually a part of the hardware. Instead, the reservation station keeps the state 
of each operation that has issued. 


Tomasulo's scheme offers two major advantages over earlier and simpler 
schemes: (1) the distribution of the hazard detection logic and (2) the elimination 
of stalls for WAW and WAR hazards. 

The first advantage arises from the distributed reservation stations and the use 
of the Common Data Bus (CDB). If multiple instructions are waiting on a single 
result, and each instruction already has its other operand, then the instructions 
can be released simultaneously by the broadcast of the result on the CDB. Ifa 
centralized register file were used, the units would have to read their results from 
the registers when register buses are available. 

The second advantage, the elimination of WAW and WAR hazards, is accom- 
plished by renaming registers using the reservation stations, and by the process of 
storing operands into the reservation station as soon as they are available. 

For example, the code sequence in Figure 2.10 issues both the DIV. D and the 
ADD. D, even though there is a WAR hazard involving F6. The hazard is elimi- 
nated in one of two ways. First, if the instruction providing the value for the 
DIV.D has completed, then Vk will store the result, allowing DIV.D to execute 
independent of the ADD. D (this is the case shown). On the other hand, if the LD 
had not completed, then Qk would point to the Load] reservation station, and the 
DIV.D instruction would be independent of the ADDD. Thus, in either case, the 
ADDD can issue and begin executing. Any uses of the result of the DIV.D would 
point to the reservation station, allowing the ADDD to complete and store its 
value into the registers without affecting the DIV. D. 

We'll see an example of the elimination of a WAW hazard shortly. But let's 
first look at how our earlier example continues execution. In this example, and 
the ones that follow in this chapter, assume the following latencies: load is 1 
clock cycle, add is 2 clock cycles, multiply is 6 clock cycles, and divide is 12 
clock cycles. 
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Instruction Issue Execute Write Result 
LD F6,32(R2) V V V 
LD F2,44(R3) yV Vv 
MUD _ FO,F2,F4 V 
SUBD F8,F2,F6 V 
DIV.D F10,F0,F6 V 
ADDD F6.F8.F2 V 

Reservation stations 
Name Busy Op Vj Vk Qj Qk A 
Loadl no 
Load2 yes Load 45 + Regs[R3] 
Addl yes SUB Mem[34 + Regs[R2]] Load2 
Add2 yes ADD Addl Load2 
Add3 no 
Multl yes MUL Regs[F4] Load2 
Mult2 yes DIV Mem[34 + Regs[R2]] Multl 

Register status 

Field FO F2 F4 F6 F8 F10 F12 ... F30 
Qi Multl Load2 Add2 Addl Mult2 





Figure 2.10 Reservation stations and register tags shown when all of the instructions have issued, but only 
the first load instruction has completed and written its result to the CDB. The second load has completed effec- 
tive address calculation, but is waiting on the memory unit. We use the array Regs{ ] to refer to the register file and 
the array Men] ] to refer to the memory. Remember that an operand is specified by either a Q field or a V field at 
any time. Notice that the ADD. D instruction, which has a WAR hazard at the WB stage, has issued and could com- 
plete before the DIV.D initiates. 


Example 


Answer 


Using the same code segment as in the previous example (page 97), show what 
the status tables look like when the MUL D is ready to write its result. 


The result is shown in the three tables in Figure 2.11. Notice that ADDD has com- 
pleted since the operands of DIV.D were copied, thereby overcoming the WAR 
hazard. Notice that even if the load of F6 was delayed, the add into F6 could be 
executed without triggering a WAW hazard. 
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Instruction Issue Execute Write Result 
L.D  F6,32(R2) : y \ y 

L.D F2,44(R3) vo \ J 
MUL.D F0,F2,F4 l J J aa 
SUB.D F8,F2,F6 o yo N 
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Reservation stations 


























Name Busy Op Vj Vk Qj Qk A 
Loadl no 

Load2 no 

Addl no 

Add2 no 

Add3 no 

Multl yes MUL Mem[45 + Regs[R3]] Regs[F4] 

Mult2 yes DIV Mem[34 + Regs[R2]] Multl 








Register status 





Field FO F2 F4 F6 F8 F10 F12 F30 
Qi Multl Mult2 





Figure 2.11 Multiply and divide are the only instructions not finished. 


Tomasulo's Algorithm:The Details 


Figure 2.12 specifies the checks and steps that each instruction must go through. 
As mentioned earlier, loads and stores go through a functional unit for effective 
address computation before proceeding to independent load or store buffers. 
Loads take a second execution step to access memory and then go to Write Result 
to send the value from memory to the register file and/or any waiting reservation 
stations. Stores complete their execution in the Write Result stage, which writes 
the result to memory. Notice that all writes occur in Write Result, whether the 
destination is a register or memory. This restriction simplifies Tomasulo's algo- 
rithm and is critical to its extension with speculation in Section 2.6. 
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Instruction state Wait until Action or bookkeeping 
Issue ~~ Station r empty if (RegisterStat[rs] .Qix) 
FP operation {RS[r].Qj < RegisterStat[rs] .Qi} 


else {RS[r].Vj + Regs[rs]; RS[r].Qj <— 0}; 
if (RegisterStat[rt] .Qi*) 

{RS[r] .Qk <— RegisterStat[rt] .Qi 
else {RS[r].Vk — Regs[rt]; RS[r].Qk <— 0}; 
RS[r].Busy <— yes; RegisterStat[rd].Q <r; 





Load or store Buffer r empty if (RegisterStat[rs] .Qix0) 
{RS[r].Qj + RegisterStat[rs] .Qi} 
else {RS[r].Vj + Regs[rs]; RS[r].Qj < 0}; 
RS[r].A <— imm; RS[r].Busy < yes; 





Load only RegisterStat[rt].Qi <r; 
Store only if (RegisterStat[rt] .Qix0) 
{RS[r] .Qk + RegisterStat[rs] .Qi} 
else {RS[r].Vk + Regs[rt]; RS[r].Qk <— 0}; 























Execute ( RS[r].Qj = 0) and Compute result: operands are in Vj and Vk 
FP operation (RS[r].Qk = 0) 
Load-store RS[r].Qj=0 & ris head of RS[r].A —RS[r].Vj + RS[r].As 
step | load-store queue 
Load step 2 Load step | complete Read from Mem[RS[r].A] 
Write Result Execution complete atr & Wx(if (RegisterStat[x].Qi=r) {Regs[x] + result; 
FP operation CDB available RegisterStat[x].Qi <— 0}); 
or Vx(if (RS[x].Qj=r) {RS[x].Vi <— result;RS[x].Qj <— 
load 0}); 
Vx (if (RS[x].Qk=r) {RS[x].Vk + result;RS[x].Qk < 
0}); 
RS[r].Busy <— no; 
Store Execution complete atr & Mem[RS[r].A] <RS[r] .Vk; 
RS[r|.Qk =0 RS[r].Busy < no; 


Figure 2.12 Steps in the algorithm and what is required for each step. For the issuing instruction, rd is the desti- 
nation, rs and rt are the source register numbers, imm is the sign-extended immediate field, and r is the reservation 
station or buffer that the instruction is assigned to. RS is the reservation station data structure.The value returned by 
an FP unit or by the load unit is called result. RegisterStat is the register status data structure (not the register file, 
which is Regs []). When an instruction is issued,the destination register has its Qi field set to the number ofthe buffer 
or reservation station to which the instruction is issued. If the operands are available in the registers, they are stored 
in the V fields. Otherwise, the Q fields are set to indicate the reservation station that will produce the values needed 
as source operands.The instruction waits at the reservation station until both its operands are available, indicated by 
zero in the Q fields. The Q fields are set to zero either when this instruction is issued, or when an instruction on which 
this instruction depends completes and does its write back. When an instruction has finished execution and the CDB 
is available, it can do its write back. All the buffers, registers, and reservation stations whose value of Qj or Qk is the 
same as the completing reservation station update their values from the CDB and mark the Q fields to indicate that 
values have been received.Thus, the CDB can broadcast its result to many destinations in a single clock cycle, and if 
the waiting instructions have their operands, they can all begin execution on the next clock cycle. Loads go through 
two steps in Execute, and stores perform slightly differently during Write Result, where they may have to wait for the 
value to store. Remember that to preserve exception behavior, instructions should not be allowed to execute if a 
branch that is earlier in program order has not yet completed. Because any concept of program order is not main- 
tained after the Issue stage, this restriction is usually implemented by preventing any instruction from leaving the 
Issue step, if there is a pending branch already in the pipeline. In Section 2.6, we will see how speculation support 
removes this restriction. 
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Tomasulo's Algorithm: A Loop-Based Example 


To understand the full power of eliminating WAW and WAR hazards through 
dynamic renaming of registers, we must look at a loop. Consider the following 
simple sequence for multiplying the elements of an array by a scalar in F2: 


Loop: L.D FO,0(R1) 
MUL.D F4,F0,F2 
S.D F4,0(R1) 
DADDIU R1,R1,-8 
BNE R1,R2,Loop; branches if R1#R2 


If we predict that branches are taken, using reservation stations will allow multi- 
ple executions of this loop to proceed at once. This advantage is gained without 
changing the code—in effect, the loop is unrolled dynamically by the hardware, 
using the reservation stations obtained by renaming to act as additional registers. 

Let's assume we have issued all the instructions in two successive iterations 
of the loop, but none of the floating-point load-stores or operations has com- 
pleted. Figure 2.13 shows reservation stations, register status tables, and load and 
store buffers at this point. (The integer ALU operation is ignored, and it is 
assumed the branch was predicted as taken.) Once the system reaches this state, 
two copies of the loop could be sustained with a CPI close to 1.0, provided the 
multiplies could complete in 4 clock cycles. With a latency of 6 cycles, additional 
iterations will need to be processed before the steady state can be reached. This 
requires more reservation stations to hold instructions that are in execution. As 
we will see later in this chapter, when extended with multiple instruction issue, 
Tomasulo's approach can sustain more than one instruction per clock. 

A load and a store can safely be done out of order, provided they access dif- 
ferent addresses. If a load and a store access the same address, then either 


e the load is before the store in program order and interchanging them results in 
a WAR hazard, or 


e the store is before the load in program order and interchanging them results in 
a RAW hazard. 


Similarly, interchanging two stores to the same address results in a WAW hazard. 

Hence, to determine if a load can be executed at a given time, the processor 
can check whether any uncompleted store that precedes the load in program order 
shares the same data memory address as the load. Similarly, a store must wait 
until there are no unexecuted loads or stores that are earlier in program order and 
share the same data memory address. We consider a method to eliminate this 
restriction in Section 2.9. 

To detect such hazards, the processor must have computed the data memory 
address associated with any earlier memory operation. A simple, but not neces- 
sarily optimal, way to guarantee that the processor has all such addresses is to 
perform the effective address calculations in program order. (We really only need 
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Instruction status 


Instruction From iteration Issue Execute Write Result 


L.D FO,0(R1) 1 

MUL.D F4,F0,F2 1 

S.D F4,0(R1) 1 
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Name Busy Op Vj Vk Qj Qk A 
Load] yes Load — Regs[R1] +0 
Load2 yes Load E Regs[R1] -8 
Addl no g 
Add2 no 
Add3 no 
Mult 1 yes MUL Regs|F2] Load] 
Mult2 yes MUL Regs|F2| Load2 
Store] yes Store Regs[R1] Mult] 
Store2 yes Store Regs[R1I] —8 Mult2 

Register status 
Field FO F2 F4 F6 F8 F10 F12 si F30 
Qi Load2 Mult2 





Figure 2.13 Two active iterations of the loop with no instruction yet completed. Entries in the multiplier reserva- 
tion stations indicate that the outstanding loads are the sources.The store reservation stations indicate that the mul- 
tiply destination is the source of the value to store. 


to keep the relative order between stores and other memory references; that is, 
loads can be reordered freely.) 

Let's consider the situation of a load first. If we perform effective address calcu- 
lation in program order, then when a load has completed effective address calcula- 
tion, we can check whether there is an address conflict by examining the A field of 
all active store buffers. If the load address matches the address of any active entries 
in the store buffer, that load instruction is not sent to the load buffer until the con- 
flicting store completes. (Some implementations bypass the value directly to the 
load from a pending store, reducing the delay for this RAW hazard.) 
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Stores operate similarly, except that the processor must check for conflicts in 
both the load buffers and the store buffers, since conflicting stores cannot be reor- 
dered with respect to either a load or a store. 

A dynamically scheduled pipeline can yield very high performance, provided 
branches are predicted accurately—an issue we addressed in the last section. The 
major drawback of this approach is the complexity of the Tomasulo scheme, 
which requires a large amount of hardware. In particular, each reservation station 
must contain an associative buffer, which must run at high speed, as well as com- 
plex control logic. The performance can also be limited by the single CDB. 
Although additional CDBs can be added, each CDB must interact with each res- 
ervation station, and the associative tag-matching hardware would need to be 
duplicated at each station for each CDB. 

In Tomasulo's scheme two different techniques are combined: the renaming 
of the architectural registers to a larger set of registers and the buffering of source 
operands from the register file. Source operand buffering resolves WAR hazards 
that arise when the operand is available in the registers. As we will see later, it is 
also possible to eliminate WAR hazards by the renaming of a register together 
with the buffering of a result until no outstanding references to the earlier version 
of the register remain. This approach will be used when we discuss hardware 
speculation. 

Tomasulo's scheme was unused for many years after the 360/91, but was 
widely adopted in multiple-issue processors starting in the 1990s for several rea- 
sons: 


1. It can achieve high performance without requiring the compiler to target code 
to a specific pipeline structure, a valuable property in the era of shrink- 
wrapped mass market software. 


2. Although Tomasulo's algorithm was designed before caches, the presence of 
caches, with the inherently unpredictable delays, has become one of the 
major motivations for dynamic scheduling. Out-of-order execution allows the 
processors to continue executing instructions while awaiting the completion 
of a cache miss, thus hiding all or part of the cache miss penalty. 


3. As processors became more aggressive in their issue capability and designers 
are concerned with the performance of difficult-to-schedule code (such as 
most nonnumeric code), techniques such as register renaming and dynamic 
scheduling become more important. 


4. Because dynamic scheduling is a key component of speculation, it was 
adopted along with hardware speculation in the mid-1990s. 


Hardware-Based Speculation 


As we try to exploit more instruction-level parallelism, maintaining control 
dependences becomes an increasing burden. Branch prediction reduces the direct 
stalls attributable to branches, but for a processor executing multiple instructions 
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per clock, just predicting branches accurately may not be sufficient to generate 
the desired amount of instruction-level parallelism. A wide issue processor may 
need to execute a branch every clock cycle to maintain maximum performance. 
Hence, exploiting more parallelism requires that we overcome the limitation of 
control dependence. 

Overcoming control dependence is done by speculating on the outcome of 
branches and executing the program as if our guesses were correct. This mech- 
anism represents a subtle, but important, extension over branch prediction with 
dynamic scheduling. In particular, with speculation, we fetch, issue, and exe- 
cute instructions, as if our branch predictions were always correct; dynamic 
scheduling only fetches and issues such instructions. Of course, we need mech- 
anisms to handle the situation where the speculation is incorrect. Appendix G 
discusses a variety of mechanisms for supporting speculation by the compiler. 
In this section, we explore hardware speculation, which extends the ideas of 
dynamic scheduling. 

Hardware-based speculation combines three key ideas: dynamic branch pre- 
diction to choose which instructions to execute, speculation to allow the execu- 
tion of instructions before the control dependences are resolved (with the ability 
to undo the effects of an incorrectly speculated sequence), and dynamic schedul- 
ing to deal with the scheduling of different combinations of basic blocks. (In 
comparison, dynamic scheduling without speculation only partially overlaps 
basic blocks because it requires that a branch be resolved before actually execut- 
ing any instructions in the successor basic block.) 

Hardware-based speculation follows the predicted flow of data values to 
choose when to execute instructions. This method of executing programs is 
essentially a data flow execution: Operations execute as soon as their operands 
are available. 

To extend Tomasulo's algorithm to support speculation, we must separate the 
bypassing of results among instructions, which is needed to execute an instruc- 
tion speculatively, from the actual completion of an instruction. By making this 
separation, we can allow an instruction to execute and to bypass its results to 
other instructions, without allowing the instruction to perform any updates that 
cannot be undone, until we know that the instruction is no longer speculative. 

Using the bypassed value is like performing a speculative register read, since 
we do not know whether the instruction providing the source register value is 
providing the correct result until the instruction is no longer speculative. When an 
instruction is no longer speculative, we allow it to update the register file or mem- 
ory; we call this additional step in the instruction execution sequence instruction 
commit. 

The key idea behind implementing speculation is to allow instructions to exe- 
cute out of order but to force them to commit in order and to prevent any irrevo- 
cable action (such as updating state or taking an exception) until an instruction 
commits. Hence, when we add speculation, we need to separate the process of 
completing execution from instruction commit, since instructions may finish exe- 
cution considerably before they are ready to commit. Adding this commit phase 
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to the instruction execution sequence requires an additional set of hardware buff- 
ers that hold the results of instructions that have finished execution but have not 
committed. This hardware buffer, which we call the reorder buffer, is also used to 
pass results among instructions that may be speculated. 

The reorder buffer (ROB) provides additional registers in the same way as the 
reservation stations in Tomasulo's algorithm extend the register set. The ROB 
holds the result of an instruction between the time the operation associated with 
the instruction completes and the time the instruction commits. Hence, the ROB 
is a source of operands for instructions, just as the reservation stations provide 
operands in Tomasulo's algorithm. The key difference is that in Tomasulo's algo- 
rithm, once an instruction writes its result, any subsequently issued instructions 
will find the result in the register file. With speculation, the register file is not 
updated until the instruction commits (and we know definitively that the instruc- 
tion should execute); thus, the ROB supplies operands in the interval between 
completion of instruction execution and instruction commit. The ROB is similar 
to the store buffer in Tomasulo's algorithm, and we integrate the function of the 
store buffer into the ROB for simplicity. 

Each entry in the ROB contains four fields: the instruction type, the destina- 
tion field, the value field, and the ready field. The instruction type field indicates 
whether the instruction is a branch (and has no destination result), a store (which 
has a memory address destination), or a register operation (ALU operation or 
load, which has register destinations). The destination field supplies the register 
number (for loads and ALU operations) or the memory address (for stores) where 
the instruction result should be written. The value field is used to hold the value 
of the instruction result until the instruction commits. We will see an example of 
ROB entries shortly. Finally, the ready field indicates that the instruction has 
completed execution, and the value is ready. 

Figure 2.14 shows the hardware structure of the processor including the ROB. 
The ROB subsumes the store buffers. Stores still execute in two steps, but the 
second step is performed by instruction commit. Although the renaming function 
of the reservation stations is replaced by the ROB, we still need a place to buffer 
operations (and operands) between the time they issue and the time they begin 
execution. This function is still provided by the reservation stations. Since every 
instruction has a position in the ROB until it commits, we tag a result using the 
ROB entry number rather than using the reservation station number. This tagging 
requires that the ROB assigned for an instruction must be tracked in the reserva- 
tion station. Later in this section, we will explore an alternative implementation 
that uses extra registers for renaming and the ROB only to track when instruc- 
tions can commit. 

Here are the four steps involved in instruction execution: 


1. Issue—Get an instruction from the instruction queue. Issue the instruction if 
there is an empty reservation station and an empty slot in the ROB; send the 
operands to the reservation station if they are available in either the registers 
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Figure 2.14 The basic structure of a FP unit using Tomasulo's algorithm and 
extended to handle speculation. Comparing this to Figure 2.9 on page 94, which 
implemented Tomasulo's algorithm, the major change is the addition of the ROB and 
the elimination of the store buffer, whose function is integrated into the ROB. This 
mechanism can be extended to multiple issue by making the CDB wider to allow for 
multiple completions per clock. 


or the ROB. Update the control entries to indicate the buffers are in use. The 
number of the ROB entry allocated for the result is also sent to the reservation 
station, so that the number can be used to tag the result when it is placed on 
the CDB. If either all reservations are full or the ROB is full, then instruction 
issue is stalled until both have available entries. 


2. Execute—If one or more of the operands is not yet available, monitor the 
CDB while waiting for the register to be computed. This step checks for 
RAW hazards. When both operands are available at a reservation station, exe- 
cute the operation. Instructions may take multiple clock cycles in this stage, 
and loads still require two steps in this stage. Stores need only have the base 
register available at this step, since execution for a store at this point is only 
effective address calculation. 
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Example 


3. Write result—When the result is available, write it on the CDB (with the 


ROB tag sent when the instruction issued) and from the CDB into the ROB, 
as well as to any reservation stations waiting for this result. Mark the reser- 
vation station as available. Special actions are required for store instruc- 
tions. If the value to be stored is available, it is written into the Value field 
of the ROB entry for the store. If the value to be stored is not available yet, 
the CDB must be monitored until that value is broadcast, at which time the 
Value field of the ROB entry of the store is updated. For simplicity we 
assume that this occurs during the Write Results stage of a store; we discuss 
relaxing this requirement later. 


Commit—This is the final stage of completing an instruction, after which 
only its result remains. (Some processors call this commit phase "comple- 
tion" or "graduation.") There are three different sequences of actions at com- 
mit depending on whether the committing instruction is a branch with an 
incorrect prediction, a store, or any other instruction (normal commit). The 
normal commit case occurs when an instruction reaches the head of the ROB 
and its result is present in the buffer; at this point, the processor updates the 
register with the result and removes the instruction from the ROB. Commit- 
ting a store is similar except that memory is updated rather than a result regis- 
ter. When a branch with incorrect prediction reaches the head of the ROB, it 
indicates that the speculation was wrong. The ROB is flushed and execution 
is restarted at the correct successor of the branch. If the branch was correctly 
predicted, the branch is finished. 


Once an instruction commits, its entry in the ROB is reclaimed and the regis- 


ter or memory destination is updated, eliminating the need for the ROB entry. If 
the ROB fills, we simply stop issuing instructions until an entry is made free. 
Now, let's examine how this scheme would work with the same example we used 
for Tomasulo's algorithm. 


Assume the same latencies for the floating-point functional units as in earlier exam- 
ples: add is 2 clock cycles, multiply is 6 clock cycles, and divide is 12 clock cycles. 
Using the code segment below, the same one we used to generate Figure 2.11, show 
what the status tables look like when the MUL D is ready to go to commit. 


LD F6,32(R2) 
LD F2,44(R3) 
MULD FO,F2,F4 
SUBD F8.F6.F2 
DIV.D F10,F0,F6 
ADDD F6,F8,F2 


Answer Figure 2.15 shows the result in the three tables. Notice that although the SUBD 


instruction has completed execution, it does not commit until the MUL D commits. 
The reservation stations and register status field contain the same basic informa- 


Example 
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tion that they did for Tomasulo's algorithm (see page 97 for a description of those 
fields). The differences are that reservation station numbers are replaced with 
ROB entry numbers in the Qj and Qk fields, as well as in the register status fields, 
and we have added the Dest field to the reservation stations. The Dest field desig- 
nates the ROB entry that is the destination for the result produced by this reserva- 
tion station entry. 


The above example illustrates the key important difference between a proces- 
sor with speculation and a processor with dynamic scheduling. Compare the con- 
tent of Figure 2.15 with that of Figure 2.11 on page 100, which shows the same 
code sequence in operation on a processor with Tomasulo's algorithm. The key 
difference is that, in the example above, no instruction after the earliest uncom- 
pleted instruction (MULD above) is allowed to complete. In contrast, in 
Figure 2.11 the SUB. D and ADD. D instructions have also completed. 

One implication of this difference is that the processor with the ROB can 
dynamically execute code while maintaining a precise interrupt model. For 
example, if the MUL D instruction caused an interrupt, we could simply wait until 
it reached the head of the ROB and take the interrupt, flushing any other pending 
instructions from the ROB. Because instruction commit happens in order, this 
yields a precise exception. 

By contrast, in the example using Tomasulo's algorithm, the SUBD and 
ADDD instructions could both complete before the MULD raised the exception. 
The result is that the registers F8 and F6 (destinations of the SUBD and ADDD 
instructions) could be overwritten, and the interrupt would be imprecise. 

Some users and architects have decided that imprecise floating-point excep- 
tions are acceptable in high-performance processors, since the program will 
likely terminate; see Appendix G for further discussion of this topic. Other types 
of exceptions, such as page faults, are much more difficult to accommodate if 
they are imprecise, since the program must transparently resume execution after 
handling such an exception. 

The use of a ROB with in-order instruction commit provides precise excep- 
tions, in addition to supporting speculative execution, as the next example shows. 


Consider the code example used earlier for Tomasulo's algorithm and shown in 
Figure 2.13 in execution: 


Loop: L.D F0,0(R1) 
MUL.D F4,F0,F2 
S.D F4,0(R1) 
DADDIU R1,R1,#-8 
BE R1,R2, Loop ;branches if R1#R2 


Assume that we have issued all the instructions in the loop twice. Let's also 
assume that the LD and MULD from the first iteration have committed and all 
other instructions have completed execution. Normally, the store would wait in 
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Reorder buffer 
Entry Busy Instruction State Destination Value 
1 no LD F6,32(R2) Commit F6 Mem[34 + Regs[R2]] 
2 no LD F2,44(R3) Commit F2 Mem[45 + Regs[R3]] 
3 yes MULD F0,F2,F4 Write result FO #2 x Regs[F4] 
4 yes SUB.D F8,F2,F6 Write result F8 #2-#1 
5 yes DIV.D F10,F0,F6 Execute F10 
6 yes ADD.D F6,F8,F2 Write result F6 #4+ #2 
Reservation stations 
Name Busy Op Vj Vk Qj Qk Dest A 
Loadl no 
Load2 no 
Addl no 
Add2 no 
Add3 no 
Multl no MULD Men[45 + Reg:s[R3]] Regs[F4] #3 
Mult2 yes DIV.D Mem[34 + Regs[R2]] #3 #5 
FP register status 
Field FO F1 F2 F3 F4 F5 F6 F7 F8 F10 
Reorder # 3 6 4 5 
Busy yes no no no no no yes yes yes 





Figure 2.15 At the time the MULD is ready to commit, only the two L.D instructions have committed, although 
several others have completed execution. The MULD is at the head of the ROB, and the two LD instructions are 
there only to ease understanding. The SUBD and ADDD instructions will not commit until the MULD instruction 
commits, although the results of the instructions are available and can be used as sources for other instructions. 
The DIV.D is in execution, but has not completed solely due to its longer latency than MUL.D.The Value column 
indicates the value being held; the format #X is used to refer to a value field of ROB entry X. Reorder buffers 1 and 
2 are actually completed, but are shown for informational purposes. We do not show the entries for the load-store 
queue, but these entries are kept in order. 


Answer 


the ROB for both the effective address operand (RI in this example) and the 
value (F4 in this example). Since we are only considering the floating-point pipe- 
line, assume the effective address for the store is computed by the time the 


instruction is issued. 


Figure 2.16 shows the result in two tables. 
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Entry Busy Instruction State Destination Value 
1 no LD F0,0(R1) Commit FO Mem[0 + 
Regs[RI]]] 
2 no MULD F4,F0,F2 Commit F4 #1 x Regs[F2] 
3 yes S.D F4,0(R1) Write result 0 + Regs[RI] #2 
4 yes DADDIU R1,R1,#-8 Write result RI Regs[R1]-8 
5 yes BNE R1,R2,Loop Write result 
6 yes LD F0,0(R1) Write result FO Mem[#4] 
7 yes MULD F4,F0,F2 Write result F4 #6 x Regs[F2] 
8 yes S.D F4,0(R1) Write result 0+ #4 #7 
9 yes DADDIU R1,R1,#-8 Write result RI #4-8 
10 yes BNE RI,R2,Loop Write result 
FP register status 
Field FO Fl F2 F3 F4 F5 F6 F7 F8 
Reorder # 6 7 
Busy yes no no no yes no no no 





Figure 2.16 Only the LD and MULD instructions have committed, although all the others have completed exe- 
cution. Hence, no reservation stations are busy and none are shown.The remaining instructions will be committed 
as fast as possible.The first two reorder buffers are empty, but are shown for completeness. 


Because neither the register values nor any memory values are actually writ- 
ten until an instruction commits, the processor can easily undo its speculative 
actions when a branch is found to be mispredicted. Suppose that the branch BNE 
is not taken the first time in Figure 2.16. The instructions prior to the branch will 
simply commit when each reaches the head of the ROB; when the branch reaches 
the head of that buffer, the buffer is simply cleared and the processor begins 
fetching instructions from the other path. 

In practice, processors that speculate try to recover as early as possible after a 
branch is mispredicted. This recovery can be done by clearing the ROB for all 
entries that appear after the mispredicted branch, allowing those that are before 
the branch in the ROB to continue, and restarting the fetch at the correct branch 
successor. In speculative processors, performance is more sensitive to the branch 
prediction, since the impact of a misprediction will be higher. Thus, all the 
aspects of handling branches—prediction accuracy, latency of misprediction 
detection, and misprediction recovery time—increase in importance. 

Exceptions are handled by not recognizing the exception until it is ready to 
commit. If a speculated instruction raises an exception, the exception is recorded 
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in the ROB. Ifa branch misprediction arises and the instruction should not have 
been executed, the exception is flushed along with the instruction when the ROB 
is cleared. If the instruction reaches the head of the ROB, then we know it is no 
longer speculative and the exception should really be taken. We can also try to 
handle exceptions as soon as they arise and all earlier branches are resolved, but 
this is more challenging in the case of exceptions than for branch mispredict and, 
because it occurs less frequently, not as critical. 

Figure 2.17 shows the steps of execution for an instruction, as well as the 
conditions that must be satisfied to proceed to the step and the actions taken. We 
show the case where mispredicted branches are not resolved until commit. 
Although speculation seems like a simple addition to dynamic scheduling, a 
comparison of Figure 2.17 with the comparable figure for Tomasulo's algorithm 
in Figure 2.12 shows that speculation adds significant complications to the con- 
trol. In addition, remember that branch mispredictions are somewhat more com- 
plex as well. 

There is an important difference in how stores are handled in a speculative 
processor versus in Tomasulo's algorithm. In Tomasulo's algorithm, a store can 
update memory when it reaches Write Result (which ensures that the effective 
address has been calculated) and the data value to store is available. In a specula- 
tive processor, a store updates memory only when it reaches the head of the ROB. 
This difference ensures that memory is not updated until an instruction is no 
longer speculative. 

Figure 2.17 has one significant simplification for stores, which is unneeded in 
practice. Figure 2.17 requires stores to wait in the Write Result stage for the reg- 
ister source operand whose value is to be stored; the value is then moved from the 
Vk field of the store's reservation station to the Value field of the store's ROB 
entry. In reality, however, the value to be stored need not arrive until just before 
the store commits and can be placed directly into the store's ROB entry by the 
sourcing instruction. This is accomplished by having the hardware track when the 
source value to be stored is available in the store's ROB entry and searching the 
ROB on every instruction completion to look for dependent stores. 

This addition is not complicated, but adding it has two effects: We would 
need to add a field to the ROB, and Figure 2.17, which is already in a small font, 
would be even longer! Although Figure 2.17 makes this simplification, in our 
examples, we will allow the store to pass through the Write Result stage and sim- 
ply wait for the value to be ready when it commits. 

Like Tomasulo's algorithm, we must avoid hazards through memory. WAW 
and WAR hazards through memory are eliminated with speculation because the 
actual updating of memory occurs in order, when a store is at the head of the 
ROB, and hence, no earlier loads or stores can still be pending. RAW hazards 
through memory are maintained by two restrictions: 


1. not allowing a load to initiate the second step of its execution if any active 
ROB entry occupied by a store has a Destination field that matches the value 
of the A field of the load, and 








Status Wait until Action or bookkeeping 
Issue if (RegisterStat[rs].Busy)/*in-flight instr. writes rs*/ 
all {h <— RegisterStat[rs].Reorder; 
instructions if (ROB[h] .Ready) /* Instr completed already 4 
{RS[r].Vj <— ROB[h].Value; RS[r].Qj < 0; 
, else {RS[r].Qj + h;} /* wait for instruction */ 
Reservation } else {RS[r].Vj & Regs[rs]; RS[r].Qj < 0;}; 
station (r) RS[r].Busy <— yes; RS[r].Dest < b; 
a m ROB[b] . Instruction < opcode; ROB[b].Dest < rd;ROB[b] .Ready < no; 
FP both available if (RegisterStat[rt].Busy) /*in-flight instr writes rt*/ 
operations {h + RegisterStat[rt] .Reorder; 
and stores if (ROB[h] .Ready)/* Instr completed already 4 
{RS[r].Vk + ROB[h].Value; RS[r].Qk < 0; 
else {RS[r].Qk + h;} /* wait for instruction */ 
} else {RS[r].Vk + Regs[rt]; RS[r].Qk < 0;}; 
FP RegisterStat[rd].Reorder 4+ b; RegisterStat[rd].Busy < yes; 
operations ROB[b].Dest < rd; 
Loads RS[r].A <— imm; RegisterStat[rt] Reorder & b; 
RegisterStat[rt].Busy < yes; ROB[b].Dest < rt; 
Stores RS[r].A <— imm; 
Execute (RS[r].Qj ==0) and Compute results—operands are in Vj and Vk 
FP op (RS[r].Qk == 0) 
Load step 1 (RSIr].Qj==0)and RS[r].A + RS[r].Vj + RS[r].A; 
there are no stores 
earlier in the queue 
Load step 2 Load step 1 done Read from Mem[RS[r].A] 
and all stores earlier 
in ROB have 
different address 
Store (RS[r].Qj ==0) and ROB[h] .Address <— RS[r].Vj + RS[r] .A; 
store at queue head 
Write result Execution doneatr b + RS[r].Dest; RS[r].Busy + no; 


all but store 


2.6 Hardware-Based Speculation 


























and CDB available 


Vx(if (RS[x].Qj==b) {RS[x].Vj < result; RS[x].Qj < 0}); 
Ux(if (RS[x].Qk==b) {RS[x].Vk + result; RS[x].Qk < 0}); 
ROB[b].Value < result; ROB[b].Ready < yes; 


Store Execution done atr ROB[h].Value <— RS[r].Vk; 
and (RS[r].Qk == 
0) 
Commit Instruction is atthe d < ROB[h].Dest; /* register dest, if exists */ 


head of the ROB 


if (ROB[h] . Instruction==Branch) 


(entry h) and {if (branch is mispredicted) 
ROB[h].ready == {clear ROB[h], RegisterStat; fetch branch dest;};} 
yes else if (ROB[h] .Instruction==Store) 


{Mem[ROB[h] .Destination] <— ROB[h] .Value;} 
else /* put the result in the register destination */ 
{Regs[d] + ROB[h].Value;}; 
ROB[h] .Busy 4+ no; /* free up ROB entry */ 
/* free up dest register if no one else writing it */ 


if (RegisterStat[d].Reorder==h) {RegisterStat[d].Busy + no;}; 





Figure 2.17 Steps in the algorithm and what is required for each step. For the issuing instruction, rd is the destina- 
tion, rs and rt are the sources, ris the reservation station allocated, b is the assigned ROB entry, and h is the head entry of 
the ROB. RS is the reservation station data structure. The value returned by a reservation station is called the resul t.Reg- 
isterStatis the register data structure, Regs represents the actual registers, and PCB is the reorder buffer data structure. 
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2.7 


2. maintaining the program order for the computation of an effective address of 
a load with respect to all earlier stores. 


Together, these two restrictions ensure that any load that accesses a memory loca- 
tion written to by an earlier store cannot perform the memory access until the 
store has written the data. Some speculative processors will actually bypass the 
value from the store to the load directly, when such a RAW hazard occurs. 
Another approach is to predict potential collisions using a form of value predic- 
tion; we consider this in Section 2.9. 

Although this explanation of speculative execution has focused on floating 
point, the techniques easily extend to the integer registers and functional units, as 
we will see in the "Putting It All Together" section. Indeed, speculation may be 
more useful in integer programs, since such programs tend to have code where 
the branch behavior is less predictable. Additionally, these techniques can be 
extended to work in a multiple-issue processor by allowing multiple instructions 
to issue and commit every clock. In fact, speculation is probably most interesting 
in such processors, since less ambitious techniques can probably exploit suffi- 
cient ILP within basic blocks when assisted by a compiler. 


Exploiting ILP Using Multiple Issue and Static 
Scheduling 


The techniques of the preceding sections can be used to eliminate data and con- 
trol stalls and achieve an ideal CPI of one. To improve performance further we 
would like to decrease the CPI to less than one. But the CPI cannot be reduced 
below one if we issue only one instruction every clock cycle. 

The goal of the multiple-issue processors, discussed in the next few sections, 
is to allow multiple instructions to issue in a clock cycle. Multiple-issue proces- 
sors come in three major flavors: 


1. statically scheduled superscalar processors, 
2. VLIW (very long instruction word) processors, and 


3. dynamically scheduled superscalar processors. 


The two types of superscalar processors issue varying numbers of instructions 
per clock and use in-order execution if they are statically scheduled or out-of- 
order execution if they are dynamically scheduled. 

VLIW processors, in contrast, issue a fixed number of instructions formatted 
either as one large instruction or as a fixed instruction packet with the parallel- 
ism among instructions explicitly indicated by the instruction. VLIW processors 
are inherently statically scheduled by the compiler. When Intel and HP created 
the IA-64 architecture, described in Appendix G, they also introduced the name 
EPIC—explicitly parallel instruction computer—for this architectural style. 
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Issue Hazard Distinguishing 
Common name structure detection Scheduling characteristic Examples 
Superscalar dynamic hardware static in-order execution mostly in the 
(static) embedded space: 
MIPS and ARM 
Superscalar dynamic hardware dynamic some out-of-order none at the present 
(dynamic) execution, but no 
speculation 
Superscalar dynamic hardware dynamic with out-of-order execution Pentium 4, 
(speculative) speculation with speculation MIPS R12K, IBM 
Power5 
VLIW/LIW static primarily static all hazards determined most examples are in 
software and indicated by compiler the embedded space, 
(often implicitly) such as the TI C6x 
EPIC primarily static primarily mostly static all hazards determined Itanium 


and indicated explicitly 
by the compiler 


software 





Figure 2.18 The five primary approaches in use for multiple-issue processors and the primary characteristics 
that distinguish them. This chapter has focused on the hardware-intensive techniques, which are all some form of 
superscalar. Appendix G focuses on compiler-based approaches. The EPIC approach, as embodied in the IA-64 archi- 
tecture, extends many of the concepts of the early VLIW approaches, providing a blend of static and dynamic 


approaches. 


Although statically scheduled superscalars issue a varying rather than a fixed 
number of instructions per clock, they are actually closer in concept to VLIWs, 
since both approaches rely on the compiler to schedule code for the processor. 
Because of the diminishing advantages of a statically scheduled superscalar as the 
issue width grows, statically scheduled superscalars are used primarily for narrow 
issue widths, normally just two instructions. Beyond that width, most designers 
choose to implement either a VLIW or a dynamically scheduled superscalar. 
Because of the similarities in hardware and required compiler technology, we 
focus on VLIWs in this section. The insights of this section are easily extrapolated 
to a statically scheduled superscalar. 

Figure 2.18 summarizes the basic approaches to multiple issue and their dis- 
tinguishing characteristics and shows processors that use each approach. 


The Basic VLIW Approach 


VLIWs use multiple, independent functional units. Rather than attempting to 
issue multiple, independent instructions to the units, a VLIW packages the multi- 
ple operations into one very long instruction, or requires that the instructions in 
the issue packet satisfy the same constraints. Since there is no fundamental 
difference in the two approaches, we will just assume that multiple operations are 
placed in one instruction, as in the original VLIW approach. 
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Example 


Answer 


Since this advantage of a VLIW increases as the maximum issue rate grows, 
we focus on a wider-issue processor. Indeed, for simple two-issue processors, the 
overhead of a superscalar is probably minimal. Many designers would probably 
argue that a four-issue processor has manageable overhead, but as we will see in 
the next chapter, the growth in overhead is a major factor limiting wider-issue 
processors. 

Let's consider a VLIW processor with instructions that contain five opera- 
tions, including one integer operation (which could also be a branch), two float- 
ing-point operations, and two memory references. The instruction would have a 
set of fields for each functional unit—perhaps 16-24 bits per unit, yielding an 
instruction length of between 80 and 120 bits. By comparison, the Intel Itanium 1 
and 2 contain 6 operations per instruction packet. 

To keep the functional units busy, there must be enough parallelism in a code 
sequence to fill the available operation slots. This parallelism is uncovered by 
unrolling loops and scheduling the code within the single larger loop body. If the 
unrolling generates straight-line code, then local scheduling techniques, which 
operate on a single basic block, can be used. If finding and exploiting the paral- 
lelism requires scheduling code across branches, a substantially more complex 
global scheduling algorithm must be used. Global scheduling algorithms are not 
only more complex in structure, but they also must deal with significantly more 
complicated trade-offs in optimization, since moving code across branches is 
expensive. 

In Appendix G, we will discuss trace scheduling, one of these global schedul- 
ing techniques developed specifically for VLIWs; we will also explore special 
hardware support that allows some conditional branches to be eliminated, extend- 
ing the usefulness of local scheduling and enhancing the performance of global 
scheduling. 

For now, we will rely on loop unrolling to generate long, straight-line code 
sequences, so that we can use local scheduling to build up VLIW instructions and 
focus on how well these processors operate. 


Suppose we have a VLIW that could issue two memory references, two FP oper- 
ations, and one integer operation or branch in every clock cycle. Show an 
unrolled version of the loop x[i] =x[i] +s (see page 76 for the MIPS code) for 
such a processor. Unroll as many times as necessary to eliminate any stalls. 
Ignore delayed branches. 


Figure 2.19 shows the code. The loop has been unrolled to make seven copies of 
the body, which eliminates all stalls (i.e., completely empty issue cycles), and 
runs in 9 cycles. This code yields a running rate of seven results in 9 cycles, or 
1.29 cycles per result, nearly twice as fast as the two-issue superscalar of Section 
2.2 that used unrolled and scheduled code. 
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Memory Memory FP FP Integer 
reference 1 reference 2 operation 1 operation 2 operation/branch 
L.D FO0,0(R1) L.D F6,-8(R1) 
L.D F10,-16(R1) L.D F14,-24(R1) 
L.D F18,-32(R1) L.D F22,-40(R1) ADD.D F4.F0.F2 ADD.D F8,F6,F2 
L.D F26,-48(R1) ADD.D F12,F10,F2 ADD.D F16,F14,F2 
ADD.D F20,F18,F2 ADD.D F24,F22,F2 
S.D F4,0(R1) S.D F8,-8(R1) ADD.D F28,F26,F2 
S.D F12,-16(R1) S.D F16,-24(R1) DADDUI RI,RI,#-56 
S.D F20,24(R1) S.D F24,16(R1) 
S.D F28,8(R1) BNE RI,R2,Loop 





Figure 2.19 VLIW instructions that occupy the inner loop and replace the unrolled sequence. This code takes 9 
cycles assuming no branch delay; normally the branch delay would also need to be scheduled. The issue rate is 23 oper- 
ations in 9 clock cycles, or 2.5 operations per cycle.The efficiency, the percentage of available slots that contained an 
operation, is about 60%.To achieve this issue rate requires a larger number of registers than MIPS would normally use in 
this loop. The VLIW code sequence above requires at least eight FP registers, while the same code sequence for the base 
MIPS processor can use as few as two FP registers or as many as five when unrolled and scheduled. 


For the original VLIW model, there were both technical and logistical prob- 
lems that make the approach less efficient. The technical problems are the 
increase in code size and the limitations of lockstep operation. Two different ele- 
ments combine to increase code size substantially for a VLIW. First, generating 
enough operations in a straight-line code fragment requires ambitiously unrolling 
loops (as in earlier examples), thereby increasing code size. Second, whenever 
instructions are not full, the unused functional units translate to wasted bits in the 
instruction encoding. In Appendix G, we examine software scheduling 
approaches, such as software pipelining, that can achieve the benefits of unrolling 
without as much code expansion. 

To combat this code size increase, clever encodings are sometimes used. 
For example, there may be only one large immediate field for use by any func- 
tional unit. Another technique is to compress the instructions in main memory 
and expand them when they are read into the cache or are decoded. In Appen- 
dix G, we show other techniques, as well as document the significant code 
expansion seen on IA-64. 

Early VLIWs operated in lockstep; there was no hazard detection hardware at 
all. This structure dictated that a stall in any functional unit pipeline must cause 
the entire processor to stall, since all the functional units must be kept synchro- 
nized. Although a compiler may be able to schedule the deterministic functional 
units to prevent stalls, predicting which data accesses will encounter a cache stall 
and scheduling them is very difficult. Hence, caches needed to be blocking and to 
cause all the functional units to stall. As the issue rate and number of memory 
references becomes large, this synchronization restriction becomes unacceptable. 
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In more recent processors, the functional units operate more independently, and 
the compiler is used to avoid hazards at issue time, while hardware checks allow 
for unsynchronized execution once instructions are issued. 

Binary code compatibility has also been a major logistical problem for 
VLIWs. In a strict VLIW approach, the code sequence makes use of both the 
instruction set definition and the detailed pipeline structure, including both func- 
tional units and their latencies. Thus, different numbers of functional units and 
unit latencies require different versions of the code. This requirement makes 
migrating between successive implementations, or between implementations 
with different issue widths, more difficult than it is for a superscalar design. Of 
course, obtaining improved performance from a new superscalar design may 
require recompilation. Nonetheless, the ability to run old binary files is a practi- 
cal advantage for the superscalar approach. 

The EPIC approach, of which the IA-64 architecture is the primary example, 
provides solutions to many of the problems encountered in early VLIW designs, 
including extensions for more aggressive software speculation and methods to 
overcome the limitation of hardware dependence while preserving binary com- 
patibility. 

The major challenge for all multiple-issue processors is to try to exploit large 
amounts of ILP. When the parallelism comes from unrolling simple loops in FP 
programs, the original loop probably could have been run efficiently on a vector 
processor (described in Appendix F). It is not clear that a multiple-issue proces- 
sor is preferred over a vector processor for such applications; the costs are simi- 
lar, and the vector processor is typically the same speed or faster. The potential 
advantages of a multiple-issue processor versus a vector processor are their abil- 
ity to extract some parallelism from less structured code and their ability to easily 
cache all forms of data. For these reasons multiple-issue approaches have become 
the primary method for taking advantage of instruction-level parallelism, and 
vectors have become primarily an extension to these processors. 


Exploiting ILP Using Dynamic Scheduling, Multiple 
Issue, and Speculation 


So far, we have seen how the individual mechanisms of dynamic scheduling, 
multiple issue, and speculation work. In this section, we put all three together, 
which yields a microarchitecture quite similar to those in modern microproces- 
sors. For simplicity, we consider only an issue rate of two instructions per clock, 
but the concepts are no different from modern processors that issue three or more 
instructions per clock. 

Let's assume we want to extend Tomasulo's algorithm to support a two-issue 
superscalar pipeline with a separate integer and floating-point unit, each of which 
can initiate an operation on every clock. We do not want to issue instructions to 
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Example 


Answer 


the reservation stations out of order, since this could lead to a violation of the pro- 
gram semantics. To gain the full advantage of dynamic scheduling we will allow 
the pipeline to issue any combination of two instructions in a clock, using the 
scheduling hardware to actually assign operations to the integer and floating- 
point unit. Because the interaction of the integer and floating-point instructions is 
crucial, we also extend Tomasulo's scheme to deal with both the integer and 
floating-point functional units and registers, as well as incorporating speculative 
execution. 

Two different approaches have been used to issue multiple instructions per 
clock in a dynamically scheduled processor, and both rely on the observation that 
the key is assigning a reservation station and updating the pipeline control tables. 
One approach is to run this step in half a clock cycle, so that two instructions can 
be processed in one clock cycle. A second alternative is to build the logic neces- 
sary to handle two instructions at once, including any possible dependences 
between the instructions. Modern superscalar processors that issue four or more 
instructions per clock often include both approaches: They both pipeline and 
widen the issue logic. 

Putting together speculative dynamic scheduling with multiple issue requires 
overcoming one additional challenge at the back end of the pipeline: we must be 
able to complete and commit multiple instructions per clock. Like the challenge 
of issuing multiple instructions, the concepts are simple, although the implemen- 
tation may be challenging in the same manner as the issue and register renaming 
process. We can show how the concepts fit together with an example. 


Consider the execution of the following loop, which increments each element of 
an integer array, on a two-issue processor, once without speculation and once 
with speculation: 


Loop: LD R2,0(R1) ;R2=array element 
DADDIU R2,R2,#1 sincrement R2 
SD R2,0(R1) ;store result 
DADDIU R1,R1,#8 sincrement pointer 
BNE R2,R3,LO00P sbranch if not last element 


Assume that there are separate integer functional units for effective address 
calculation, for ALU operations, and for branch condition evaluation. Create a 
table for the first three iterations of this loop for both processors. Assume that up 
to two instructions of any type can commit per clock. 


Figures 2.20 and 2.21 show the performance for a two-issue dynamically sched- 
uled processor, without and with speculation. In this case, where a branch can be 
a critical performance limiter, speculation helps significantly. The third branch in 
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Memory 

Issues at Executes at access at Write CDB it 
Iteratic n clock cycle clock cycle clock cycle clock cycle 
number Instructions number number number number Comment 
1 LD  R2,0(R1) 1 2 3 4 First issue 
1 DADDIU R2,R2,#1 1 5 6 Wait for LW 
1 SD R2,0(R1) 2 3 7 Wait for DADDIU 
1 DADDIU R1,R1,#8 2 3 4 Execute directly 
1 BNE R2,R3,LO0P 3 7 Wait for DADDIU 
2 LD R2,0(R1) 4 8 9 10 Wait for BNE 
2 DADDIU R2,R2,#1 4 11 12 Wait for LW 
2 SD R2,0(R1) 5 9 13 Wait for DADDIU 
2 DADDIU R1,R1,#8 5 8 9 Wait for BNE 
2 BNE R2,R3,L00P 6 13 Wait for DADDIU 
3 LD R2,0(R1) 7 14 15 16 Wait for BNE 
3 DADDIU R2,R2,#1 7 17 18 Wait for LW 
3 SD R2,0(R1) 8 15 19 Wait for DADDIU 
3 DADDIU R1,R1,#8 8 14 15 Wait for BNE 
3 BNE R2,R3,LO0P 9 19 Wait for DADDIU 





Figure 2.20 The time of issue, execution, and writing result for a dual-issue version of our pipeline without 
speculation. Note that the LD following the BNE cannot start execution earlier because it must wait until the branch 
outcome is determined. This type of program, with data-dependent branches that cannot be resolved earlier, shows 
the strength of speculation. Separate functional units for address calculation, ALU operations, and branch-condition 
evaluation allow multiple instructions to execute in the same cycle. Figure 2.21 shows this example with speculation, 


the speculative processor executes in clock cycle 13, while it executes in clock 
cycle 19 on the nonspeculative pipeline. Because the completion rate on the non- 
speculative pipeline is falling behind the issue rate rapidly, the nonspeculative 
pipeline will stall when a few more iterations are issued. The performance of the 
nonspeculative processor could be improved by allowing load instructions to 
complete effective address calculation before a branch is decided, but unless 
speculative memory accesses are allowed, this improvement will gain only 1 
clock per iteration. 


This example clearly shows how speculation can be advantageous when there 
are data-dependent branches, which otherwise would limit performance. This 
advantage depends, however, on accurate branch prediction. Incorrect specula- 
tion will not improve performance, but will, in fact, typically harm performance. 
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Write 
Issues Executes Readaccess CDBat Commits 
Iteration at clock at clock at clock clock at clock 
number Instructions number number number number number Comment 
LD R2,0(R1) 1 2 3 4 5 First issue 
DADDIU R2,R2,#1 1 5 6 7 Wait for LW 
SD R2,0(R1) 2 3 7 Wait for DADDIU 
DADDIU R1,R1,#8 2 3 4 8 Commit in order 
BNE R2,R3,LO0P 3 7 8 Wait for DADDIU 
2 LD R2,0(R1) 4 5 6 9 No execute delay 
2 DADDIU R2,R2,#1 4 8 10 Wait for LW 
2 SD R2,0(R1) 5 6 10 Wait for DADDIU 
2 DADDIU R1,R1,#8 5 6 7 11 Commit in order 
2 BNE R2,R3,LO0P 6 10 11 Wait for DADDIU 
3 LD R2,0(R1) 7 8 9 10 12 Earliest possible 
3 DADDIU R2,R2,#1 7 11 12 13 Wait for LW 
3 D R2,0(R1) 8 13 Wait for DADDIU 
3 DADDIU R1,R1,#8 8 9 10 14 Executes earlier 
3 BNE R2,R3,L00P 9 13 14 Wait for DADDIU 





Figure 2.21 The time of issue, execution, and writing result for a dual-issue version of our pipeline with specula- 
tion. Note that the ID following the BNE can start execution early because it is speculative. 


Advanced Techniques for Instruction Delivery 
and Speculation 


In a high-performance pipeline, especially one with multiple issue, predicting 
branches well is not enough; we actually have to be able to deliver a high- 
bandwidth instruction stream. In recent multiple-issue processors, this has meant 
delivering 4-8 instructions every clock cycle. We look at methods for increasing 
instruction delivery bandwidth first. We then turn to a set of key issues in imple- 
menting advanced speculation techniques, including the use of register renaming 
versus reorder buffers, the aggressiveness of speculation, and a technique called 
value prediction, which could further enhance ILP. 


Increasing Instruction Fetch Bandwidth 


A multiple issue processor will require that the average number of instructions 
fetched every clock cycle be at least as large as the average throughput. Of 
course, fetching these instructions requires wide enough paths to the instruction 
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cache, but the most difficult aspect is handling branches. In this section we look 
at two methods for dealing with branches and then discuss how modern proces- 
sors integrate the instruction prediction and prefetch functions. 


Branch-Target Buffers 


To reduce the branch penalty for our simple five-stage pipeline, as well as for 
deeper pipelines, we must know whether the as-yet-undecoded instruction is a 
branch and, if so, what the next PC should be. If the instruction is a branch and 
we know what the next PC should be, we can have a branch penalty of zero. A 
branch-prediction cache that stores the predicted address for the next instruction 
after a branch is called a branch-target buffer or branch-target cache. Figure 2.22 
shows a branch-target buffer. 

Because a branch-target buffer predicts the next instruction address and will 
send it out before decoding the instruction, we must know whether the fetched 
instruction is predicted as a taken branch. If the PC of the fetched instruction 
matches a PC in the prediction buffer, then the corresponding predicted PC is 
used as the next PC. The hardware for this branch-target buffer is essentially 
identical to the hardware for a cache. 
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Figure 2.22 A branch-target buffer.The PC of the instruction being fetched is matched 
against a set of instruction addresses stored in the first column; these represent the 
addresses of known branches. If the PC matches one of these entries, then the instruction 
being fetched is a taken branch, and the second field, predicted PC, contains the predic- 
tion for the next PC after the branch. Fetching begins immediately at that address. The 
third field, which is optional, may be used for extra prediction state bits. 
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If a matching entry is found in the branch-target buffer, fetching begins 
immediately at the predicted PC. Note that unlike a branch-prediction buffer, the 
predictive entry must be matched to this instruction because the predicted PC will 
be sent out before it is known whether this instruction is even a branch. If the pro- 
cessor did not check whether the entry matched this PC, then the wrong PC 
would be sent out for instructions that were not branches, resulting in a slower 
processor. We only need to store the predicted-taken branches in the branch-tar- 
get buffer, since an untaken branch should simply fetch the next sequential 
instruction, as if it were not a branch. 

Figure 2.23 shows the detailed steps when using a branch-target buffer for a 
simple five-stage pipeline. From this we can see that there will be no branch 
delay if a branch-prediction entry is found in the buffer and the prediction is cor- 
rect. Otherwise, there will be a penalty of at least 2 clock cycles. Dealing with the 
mispredictions and misses is a significant challenge, since we typically will have 
to halt instruction fetch while we rewrite the buffer entry. Thus, we would like to 
make this process fast to minimize the penalty. 

To evaluate how well a branch-target buffer works, we first must determine 
the penalties in all possible cases. Figure 2.24 contains this information for the 
simple five-stage pipeline. 


Example Determine the total branch penalty for a branch-target buffer assuming the pen- 
alty cycles for individual mispredictions from Figure 2.24. Make the following 
assumptions about the prediction accuracy and hit rate: 


e Prediction accuracy is 90% (for instructions in the buffer). 


e Hit rate in the buffer is 90% (for branches predicted taken). 


Answer We compute the penalty by looking at the probability of two events: the branch is 
predicted taken but ends up being not taken, and the branch is taken but is not 
found in the buffer. Both carry a penalty of 2 cycles. 


Probability (branch in buffer, but actually not taken) = Percent buffer hit rate x Percent incorrect predictions 
90% x 10% = 0.09 
Probability (branch not in buffer, but actually taken) = 10% 

Branch penalty = (0.09 + 0.10) x 2 

Branch penalty = 0.38 


This penalty compares with a branch penalty for delayed branches, which we 
evaluate in Appendix A, of about 0.5 clock cycles per branch. Remember, though, 
that the improvement from dynamic branch prediction will grow as the pipeline 
length and, hence, the branch delay grows; in addition, better predictors will 
yield a larger performance advantage. 
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Figure 2.23 The steps involved in handling an instruction with a branch-target 

















buffer. 

Instruction in buffer Prediction Actual branch Penalty cycles 
yes taken taken 0 

yes taken not taken 2 

no taken 2 

no not taken 0 





Figure 2.24 Penalties for all possible combinations of whether the branch is in the 
buffer and what it actually does, assuming we store only taken branches in the 
buffer. There is no branch penalty if everything is correctly predicted and the branch is 
found in the target buffer. If the branch is not correctly predicted, the penalty is equal 
to 1 clock cycle to update the buffer with the correct information (during which an 
instruction cannot be fetched) and 1 clock cycle, if needed, to restart fetching the next 
correct instruction for the branch. Ifthe branch is not found and taken, a 2-cycle pen- 
alty is encountered, during which time the buffer is updated. 
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One variation on the branch-target buffer is to store one or more target 
instructions instead of, or in addition to, the predicted target address. This varia- 
tion has two potential advantages. First, it allows the branch-target buffer access 
to take longer than the time between successive instruction fetches, possibly 
allowing a larger branch-target buffer. Second, buffering the actual target instruc- 
tions allows us to perform an optimization called branch folding. Branch folding 
can be used to obtain 0-cycle unconditional branches, and sometimes 0-cycle 
conditional branches. Consider a branch-target buffer that buffers instructions 
from the predicted path and is being accessed with the address of an uncondi- 
tional branch. The only function of the unconditional branch is to change the PC. 
Thus, when the branch-target buffer signals a hit and indicates that the branch is 
unconditional, the pipeline can simply substitute the instruction from the branch- 
target buffer in place of the instruction that is returned from the cache (which is 
the unconditional branch). If the processor is issuing multiple instructions per 
cycle, then the buffer will need to supply multiple instructions to obtain the max- 
imum benefit. In some cases, it may be possible to eliminate the cost of a condi- 
tional branch when the condition codes are preset. 


Return Address Predictors 


As we try to increase the opportunity and accuracy of speculation we face the 
challenge of predicting indirect jumps, that is, jumps whose destination address 
varies at run time. Although high-level language programs will generate such 
jumps for indirect procedure calls, select or case statements, and FORTRAN- 
computed gotos, the vast majority of the indirect jumps come from procedure 
returns. For example, for the SPEC95 benchmarks, procedure returns account for 
more than 15% of the branches and the vast majority of the indirect jumps on 
average. For object-oriented languages like C++ and Java, procedure returns are 
even more frequent. Thus, focusing on procedure returns seems appropriate. 

Though procedure returns can be predicted with a branch-target buffer, the 
accuracy of such a prediction technique can be low if the procedure is called from 
multiple sites and the calls from one site are not clustered in time. For example, 
in SPEC CPU95, an aggressive branch predictor achieves an accuracy of less 
than 60% for such return branches. To overcome this problem, some designs use 
a small buffer of return addresses operating as a stack. This structure caches the 
most recent return addresses: pushing a return address on the stack at a call and 
popping one off at a return. If the cache is sufficiently large (i.e., as large as the 
maximum call depth), it will predict the returns perfectly. Figure 2.25 shows the 
performance of such a return buffer with 0-16 elements for a number of the 
SPEC CPU95 benchmarks. We will use a similar return predictor when we exam- 
ine the studies of ILP in Section 3.2. 
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Figure 2.25 Prediction accuracy for a return address buffer operated as a stack on a 
number of SPEC CPU95 benchmarks. The accuracy is the fraction of return addresses 
predicted correctly. A buffer of O entries implies that the standard branch prediction is 
used. Since call depths are typically not large, with some exceptions, a modest buffer 
works well.This data comes from Skadron etal. (1999), and uses a fix-up mechanism to 
prevent corruption of the cached return addresses. 


Integrated Instruction Fetch Units 


To meet the demands of multiple-issue processors, many recent designers have 
chosen to implement an integrated instruction fetch unit, as a separate autono- 
mous unit that feeds instructions to the rest of the pipeline'. Essentially, this 
amounts to recognizing that characterizing instruction fetch as a simple single 
pipe stage given the complexities of multiple issue is no longer valid. 

Instead, recent designs have used an integrated instruction fetch unit that inte- 
grates several functions: 


1. Integrated branch prediction—The branch predictor becomes part of the 
instruction fetch unit and is constantly predicting branches, so as to drive the 
fetch pipeline. 


2. Instruction prefetch—To deliver multiple instructions per clock, the 
instruction fetch unit will likely need to fetch ahead. The unit autonomously 
manages the prefetching of instructions (see Chapter 5 for a discussion of 
techniques for doing this), integrating it with branch prediction. 
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3. Instruction memory access and buffering—When fetching multiple instruc- 
tions per cycle a variety of complexities are encountered, including the diffi- 
culty that fetching multiple instructions may require accessing multiple cache 
lines. The instruction fetch unit encapsulates this complexity, using prefetch 
to try to hide the cost of crossing cache blocks. The instruction fetch unit also 
provides buffering, essentially acting as an on-demand unit to provide 
instructions to the issue stage as needed and in the quantity needed. 


As designers try to increase the number of instructions executed per clock, 
instruction fetch will become an ever more significant bottleneck, and clever new 
ideas will be needed to deliver instructions at the necessary rate. One of the 
newer ideas, called trace caches and used in the Pentium 4, is discussed in 
Appendix C. 


Speculation: Implementation Issues and Extensions 


In this section we explore three issues that involve the implementation of specu- 
lation, starting with the use of register renaming, the approach that has almost 
totally replaced the use of a reorder buffer. We then discuss one important possi- 
ble extension to speculation on control flow: an idea called value prediction. 


Speculation Support: Register Renaming versus Reorder Buffers 


One alternative to the use of a reorder buffer (ROB) is the explicit use of a larger 
physical set of registers combined with register renaming. This approach builds 
on the concept of renaming used in Tomasulo's algorithm and extends it. In 
Tomasulo's algorithm, the values of the architecturally visible registers (RO, 

R31 and FO,..., F31) are contained, at any point in execution, in some combina- 
tion of the register set and the reservation stations. With the addition of specula- 
tion, register values may also temporarily reside in the ROB. In either case, if the 
processor does not issue new instructions for a period of time, all existing 
instructions will commit, and the register values will appear in the register file, 
which directly corresponds to the architecturally visible registers. 

In the register-renaming approach, an extended set of physical registers is 
used to hold both the architecturally visible registers as well as temporary values. 
Thus, the extended registers replace the function of both the ROB and the reser- 
vation stations. During instruction issue, a renaming process maps the names of 
architectural registers to physical register numbers in the extended register set, 
allocating a new unused register for the destination. WAW and WAR hazards are 
avoided by renaming of the destination register, and speculation recovery is han- 
dled because a physical register holding an instruction destination does not 
become the architectural register until the instruction commits. The renaming 
map is a simple data structure that supplies the physical register number of the 
register that currently corresponds to the specified architectural register. This 
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structure is similar in structure and function to the register status table in Toma- 
sulo's algorithm. When an instruction commits, the renaming table is perma- 
nently updated to indicate that a physical register corresponds to the actual 
architectural register, thus effectively finalizing the update to the processor state. 

An advantage of the renaming approach versus the ROB approach is that 
instruction commit is simplified, since it requires only two simple actions: record 
that the mapping between an architectural register number and physical register 
number is no longer speculative, and free up any physical registers being used to 
hold the "older" value of the architectural register. In a design with reservation 
stations, a station is freed up when the instruction using it completes execution, 
and a ROB entry is freed up when the corresponding instruction commits. 

With register renaming, deallocating registers is more complex, since before 
we free up a physical register, we must know that it no longer corresponds to an 
architectural register, and that no further uses of the physical register are out- 
standing. A physical register corresponds to an architectural register until the 
architectural register is rewritten, causing the renaming table to point elsewhere. 
That is, if no renaming entry points to a particular physical register, then it no 
longer corresponds to an architectural register. There may, however, still be uses 
of the physical register outstanding. The processor can determine whether this is 
the case by examining the source register specifiers of all instructions in the func- 
tional unit queues. If a given physical register does not appear as a source and it is 
not designated as an architectural register, it may be reclaimed and reallocated. 

Alternatively, the processor can simply wait until another instruction that 
writes the same architectural register commits. At that point, there can be no fur- 
ther uses of the older value outstanding. Although this method may tie up a phys- 
ical register slightly longer than necessary, it is easy to implement and hence is 
used in several recent superscalars. 

One question you may be asking is, How do we ever know which registers are 
the architectural registers if they are constantly changing? Most of the time when 
the program is executing it does not matter. There are clearly cases, however, 
where another process, such as the operating system, must be able to know 
exactly where the contents of a certain architectural register reside. To understand 
how this capability is provided, assume the processor does not issue instructions 
for some period of time. Eventually all instructions in the pipeline will commit, 
and the mapping between the architecturally visible registers and physical regis- 
ters will become stable. At that point, a subset of the physical registers contains 
the architecturally visible registers, and the value of any physical register not 
associated with an architectural register is unneeded. It is then easy to move the 
architectural registers to a fixed subset of physical registers so that the values can 
be communicated to another process. 

Within the past few years most high-end superscalar processors, including the 
Pentium series, the MIPS R12000, and the Power and PowerPC processors, have 
chosen to use register renaming, adding from 20 to 80 extra registers. Since all 
results are allocated a new virtual register until they commit, these extra registers 
replace a primary function of the ROB and largely determine how many instruc- 
tions may be in execution (between issue and commit) at one time. 
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How Much to Speculate 


One of the significant advantages of speculation is its ability to uncover events 
that would otherwise stall the pipeline early, such as cache misses. This potential 
advantage, however, comes with a significant potential disadvantage. Speculation 
is not free: it takes time and energy, and the recovery of incorrect speculation fur- 
ther reduces performance. In addition, to support the higher instruction execution 
rate needed to benefit from speculation, the processor must have additional 
resources, which take silicon area and power. Finally, if speculation causes an 
exceptional event to occur, such as a cache or TLB miss, the potential for signifi- 
cant performance loss increases, if that event would not have occurred without 
speculation. 

To maintain most of the advantage, while minimizing the disadvantages, most 
pipelines with speculation will allow only low-cost exceptional events (such as a 
first-level cache miss) to be handled in speculative mode. If an expensive excep- 
tional event occurs, such as a second-level cache miss or a translation lookaside 
buffer (TLB) miss, the processor will wait until the instruction causing the event 
is no longer speculative before handling the event. Although this may slightly 
degrade the performance of some programs, it avoids significant performance 
losses in others, especially those that suffer from a high frequency of such events 
coupled with less-than-excellent branch prediction. 

In the 1990s, the potential downsides of speculation were less obvious. As 
processors have evolved, the real costs of speculation have become more appar- 
ent, and the limitations of wider issue and speculation have been obvious. We 
return to this issue in the next chapter. 


Speculating through Multiple Branches 


In the examples we have considered in this chapter, it has been possible to resolve 
a branch before having to speculate on another. Three different situations can 
benefit from speculating on multiple branches simultaneously: a very high branch 
frequency, significant clustering of branches, and long delays in functional units. 
In the first two cases, achieving high performance may mean that multiple 
branches are speculated, and it may even mean handling more than one branch 
per clock. Database programs, and other less structured integer computations, 
often exhibit these properties, making speculation on multiple branches impor- 
tant. Likewise, long delays in functional units can raise the importance of specu- 
lating on multiple branches as a way to avoid stalls from the longer pipeline 
delays. 

Speculating on multiple branches slightly complicates the process of specula- 
tion recovery, but is straightforward otherwise. A more complex technique is 
predicting and speculating on more than one branch per cycle. The IBM Power2 
could resolve two branches per cycle but did not speculate on any other instruc- 
tions. As of 2005, no processor has yet combined full speculation with resolving 
multiple branches per cycle. 
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Value Prediction 


One technique for increasing the amount of ILP available in a program is value 
prediction. Value prediction attempts to predict the value that will be produced by 
an instruction. Obviously, since most instructions produce a different value every 
time they are executed (or at least a different value from a set of values), value 
prediction can have only limited success. There are, however, certain instructions 
for which it is easier to predict the resulting value—for example, loads that load 
from a constant pool, or that load a value that changes infrequently. In addition, 
when an instruction produces a value chosen from a small set of potential values, 
it may be possible to predict the resulting value by correlating it without an 
instance. 

Value prediction is useful if it significantly increases the amount of available 
ILP. This possibility is most likely when a value is used as the source of a chain 
of dependent computations, such as a load. Because value prediction is used to 
enhance speculations and incorrect speculation has detrimental performance 
impact, the accuracy of the prediction is critical. 

Much of the focus of research on value prediction has been on loads. We can 
estimate the maximum accuracy of a load value predictor by examining how 
often a load returns a value that matches a value returned in a recent execution of 
the load. The simplest case to examine is when the load returns a value that 
matches the value on the last execution of the load. For a range of SPEC 
CPU2000 benchmarks, this redundancy occurs from less than 5% of the time to 
almost 80% of the time. If we allow the load to match any of the most recent 16 
values returned, the frequency of a potential match increases, and many bench- 
marks show a 80% match rate. Of course, matching 1 of 16 recent values does 
not tell you what value to predict, but it does mean that even with additional 
information it is impossible for prediction accuracy to exceed 80%. 

Because of the high costs of misprediction and the likely case that mispredic- 
tion rates will be significant (20% to 50%), researchers have focused on assessing 
which loads are more predictable and only attempting to predict those. This leads 
to a lower misprediction rate, but also fewer candidates for accelerating through 
prediction. In the limit, if we attempt to predict only those loads that always 
return the same value, it is likely that only 10% to 15% of the loads can be pre- 
dicted. Research on value prediction continues. The results to date, however, have 
not been sufficiently compelling that any commercial processor has included the 
capability. 

One simple idea that has been adopted and is related to value prediction is 
address aliasing prediction. Address aliasing prediction is a simple technique that 
predicts whether two stores or a load and a store refer to the same memory 
address. If two such references do not refer to the same address, then they may be 
safely interchanged. Otherwise, we must wait until the memory addresses 
accessed by the instructions are known. Because we need not actually predict the 
address values, only whether such values conflict, the prediction is both more sta- 
ble and simpler. Hence, this limited form of address value speculation has been 
used by a few processors. 
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Putting It All Together: The Intel Pentium 4 


The Pentium 4 is a processor with a deep pipeline supporting multiple issue with 
speculation. In this section, we describe the highlights of the Pentium 4 microar- 
chitecture and examine its performance for the SPEC CPU benchmarks. The 
Pentium 4 also supports multithreading, a topic we discuss in the next chapter. 

The Pentium 4 uses an aggressive out-of-order speculative microarchitecture, 
called Netburst, that is deeply pipelined with the goal of achieving high instruc- 
tion throughput by combining multiple issue and high clock rates. Like the mi- 
croarchitecture used in the Pentium III, a front-end decoder translates each IA-32 
instruction to a series of micro-operations (uops), which are similar to typical 
RISC instructions. The uops are than executed by a dynamically scheduled spec- 
ulative pipeline. 

The Pentium 4 uses a novel execution trace cache to generate the uop instruc- 
tion stream, as opposed to a conventional instruction cache that would hold IA-32 
instructions. A trace cache is a type of instruction cache that holds sequences of 
instructions to be executed including nonadjacent instructions separated by 
branches; a trace cache tries to exploit the temporal sequencing of instruction ex- 
ecution rather than the spatial locality exploited in a normal cache; trace caches 
are explained in detail in Appendix C. 

The Pentium 4's execution trace cache is a trace cache of uops, corresponding 
to the decoded IA-32 instruction stream. By filling the pipeline from the execu- 
tion trace cache, the Pentium 4 avoids the need to redecode IA-32 instructions 
whenever the trace cache hits. Only on trace cache misses are [A-32 instructions 
fetched from the L2 cache and decoded to refill the execution trace cache. Up to 
three IA-32 instructions may be decoded and translated every cycle, generating 
up to six uops; when a single IA-32 instruction requires more than three uops, the 
uop sequence is generated from the microcode ROM. 

The execution trace cache has its own branch target buffer, which predicts the 
outcome of uop branches. The high hit rate in the execution trace cache (for ex- 
ample, the trace cache miss rate for the SPEC CPUINT2000 benchmarks is less 
than 0.15%), means that the IA-32 instruction fetch and decode is rarely needed. 

After fetching from the execution trace cache, the uops are executed by an 
out-of-order speculative pipeline, similar to that in Section 2.6, but using register 
renaming rather than a reorder buffer. Up to three uops per clock can be renamed 
and dispatched to the functional unit queues, and three uops can be committed 
each clock cycle. There are four dispatch ports, which allow a total of six uops to 
be dispatched to the functional units every clock cycle. The load and store units 
each have their own dispatch port, another port covers basic ALU operations, and 
a fourth handles FP and integer operations. Figure 2.26 shows a diagram of the 
microarchitecture. 

Since the Pentium 4 microarchitecture is dynamically scheduled, uops do not 
follow a simple static set of pipeline stages during their execution. Instead vari- 
ous stages of execution (instruction fetch, decode, uop issue, rename, schedule, 
execute, and retire) can take varying numbers of clock cycles. In the Pentium III, 
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Figure 2.26 The Pentium 4 microarchitecture.The cache sizes represent the Pentium 4 640. Note that the instruc- 
tions are usually coming from the trace cache; only when the trace cache misses is the front-end instruction prefetch 
unit consulted. This figure was adapted from Boggs et al. [2004]. 

















the minimum time for an instruction to go from fetch to retire was 11 clock 
cycles, with instructions requiring multiple clock cycles in the execution stage 
taking longer. As in any dynamically scheduled pipeline, instructions could take 
much longer if they had to wait for operands. As stated earlier, the Pentium 4 
introduced a much deeper pipeline, partitioning stages of the Pentium III pipeline 
so as to achieve a higher clock rate. In the initial Pentium 4 introduced in 1990, 
the minimum number of cycles to transit the pipeline was increased to 21, allow- 
ing for a 1.5 GHz clock rate. In 2004, Intel introduced a version of the Pentium 4 
with a 3.2 GHz clock rate. To achieve this high clock rate, further pipelining was 
added so that a simple instruction takes 31 clock cycles to go from fetch to retire. 
This additional pipelining, together with improvements in transistor speed, 
allowed the clock rate to more than double over the first Pentium 4. 

Obviously, with such deep pipelines and aggressive clock rates the cost of 
cache misses and branch mispredictions are both very high A two-level cache is 
used to minimize the frequency of DRAM accesses. Branch prediction is done 
with a branch-target buffer using a two-level predictor with both local and global 
branch histories; in the most recent Pentium 4, the size of the branch-target buffer 
was increased, and the static predictor, used when the branch-target buffer 
misses, was improved. Figure 2.27 summarizes key features of the microarchitec- 
ture, and the caption notes some of the changes since the first version of the Pen- 
tium 4 in 2000. 
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Feature Size Comments 

Front-end branch-target 4K entries Predicts the next IA-32 instruction to fetch; used only when the 
buffer execution trace cache misses. 

Execution trace cache 12K uops Trace cache used for uops. 

Trace cache branch- 2K entries Predicts the next uop. 

target buffer 

Registers for renaming 128 total 128 uops can be in execution with up to 48 loads and 32 stores. 





Functional units 


7 total: 2 simple ALU, 
complex ALU, load, store, 
FP move, FP arithmetic 


The simple ALU units run at twice the clock rate, accepting up 
to two simple ALU uops every clock cycle. This allows 
execution of two dependent ALU operations in a single clock 
cycle. 





LI data cache 


16 KB; 8-way associative; 
64-byte blocks 
write through 


Integer load to use latency is 4 cycles; FP load to use latency is 
12 cycles; up to 8 outstanding load misses. 





L2 cache 


2 MB; 8-way associative; 
128-byte blocks 
write back 


256 bits to LI, providing 108 GB/sec; 18-cycle access time; 64 
bits to memory capable of 6.4 GB/sec. A miss in L2 does not 
cause an automatic update of LI. 





Figure 2.27 Important characteristics of the recent Pentium 4 640 implementation in 90 nm technology (code 
named Prescott). The newer Pentium 4 uses larger caches and branch-prediction buffers, allows more loads and 
stores outstanding, and has higher bandwidth between levels in the memory system. Note the novel use of double- 
speed ALUs, which allow the execution of back-to-back dependent ALU operations in a single clock cycle; having 
twice as many ALUs, an alternative design point, would not allow this capability. The original Pentium 4 used a trace 
cache BTB with 512 entries, an LI cache of 8 KB, and an L2 cache of 256 KB. 


An Analysis of the Performance of the Pentium 4 


The deep pipeline of the Pentium 4 makes the use of speculation, and its depen- 
dence on branch prediction, critical to achieving high performance. Likewise, 
performance is very dependent on the memory system. Although dynamic sched- 
uling and the large number of outstanding loads and stores supports hiding the 
latency of cache misses, the aggressive 3.2 GHz clock rate means that L2 misses 
are likely to cause a stall as the queues fill up while awaiting the completion of 
the miss. 

Because of the importance of branch prediction and cache misses, we focus 
our attention on these two areas. The charts in this section use five of the integer 
SPEC CPU2000 benchmarks and five of the FP benchmarks, and the data is cap- 
tured using counters within the Pentium 4 designed for performance monitoring. 
The processor is a Pentium 4 640 running at 3.2 GHz with an 800 MHz system 
bus and 667 MHz DDR2 DRAMs for main memory. 

Figure 2.28 shows the branch-misprediction rate in terms of mispredictions 
per 1000 instructions. Remember that in terms of pipeline performance, what 
matters is the number of mispredictions per instruction; the FP benchmarks gen- 
erally have fewer branches per instruction (48 branches per 1000 instructions) 
versus the integer benchmarks (186 branches per 1000 instructions), as well as 
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Figure 2.28 Branch misprediction rate per 1000 instructions for five integer and 
five floating-point benchmarks from the SPEC CPU2000 benchmark suite. This data 
and the rest of the data in this section were acquired by John Holm and Dileep Bhan- 
darkar of Intel. 


better prediction rates (98% versus 94%). The result, as Figure 2.28 shows, is that 
the misprediction rate per instruction for the integer benchmarks is more than 8 
times higher than the rate for the FP benchmarks. 

Branch-prediction accuracy is crucial in speculative processors, since incor- 
rect speculation requires recovery time and wastes energy pursuing the wrong 
path. Figure 2.29 shows the fraction of executed uops that are the result of mis- 
speculation. As we would suspect, the misspeculation rate results look almost 
identical to the misprediction rates. 

How do the cache miss rates contribute to possible performance losses? The 
trace cache miss rate is almost negligible for this set of the SPEC benchmarks, 
with only one benchmark (186.craft) showing any significant misses (0.6%). The 
LI and L2 miss rates are more significant. Figure 2.30 shows the LI and L2 miss 
rates for these 10 benchmarks. Although the miss rate for LI is about 14 times 
higher than the miss rate for L2, the miss penalty for L2 is comparably higher, 
and the inability of the microarchitecture to hide these very long misses means 
that L2 misses likely are responsible for an equal or greater performance loss 
than LI misses, especially for benchmarks such as mcf and swim. 

How do the effects of misspeculation and cache misses translate to actual per- 
formance? Figure 2.31 shows the effective CPI for the 10 SPEC CPU2000 
benchmarks. There are three benchmarks whose performance stands out from the 
pack and are worth examining: 
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Figure 2.29 The percentage of uop instructions issued that are misspeculated. 
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Figure 2.30 LI data cache and L2 cache misses per 1000 instructions for 10 SPEC CPU2000 benchmarks. Note 
that the scale of the L1 misses is 10 times that of the L2 misses. Because the miss penalty for L2 is likely to be at least 
10 times larger than for LI, the relative sizes of the bars are an indication of the relative performance penalty for the 
misses in each cache.The inability to hide long L2 misses with overlapping execution will further increase the stalls 
caused by L2 misses relative to L1 misses. 
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Figure 2.31 The CPI for the 10 SPEC CPU benchmarks. An increase in the CPI by a fac- 
tor of 1.29 comes from the translation of IA-32 instructions into uops, which results in 
1.29 uops per IA-32 instruction on average for these 10 benchmarks. 


1. mcf has a CPI that is more than four times higher than that of the four other 
integer benchmarks. It has the worst misspeculation rate. Equally impor- 
tantly, mcf has the worst LI and the worst L2 miss rate among any bench- 
mark, integer or floating point, in the SPEC suite. The high cache miss rates 
make it impossible for the processor to hide significant amounts of miss 
latency. 


2. vpr achieves a CPI that is 1.6 times higher than three of the five integer 
benchmarks (excluding mcf). This appears to arise from a branch mispredic- 
tion that is the worst among the integer benchmarks (although not much 
worse than the average) together with a high L2 miss rate, second only to mcf 
among the integer benchmarks. 


3. swim is the lowest performing FP benchmark, with a CPI that is more than 
two times the average of the other four FP benchmarks, swim's problems are 
high LI and L2 cache miss rates, second only to mcf. Notice that swim has 
excellent speculation results, but that success can probably not hide the high 
miss rates, especially in L2. In contrast, several benchmarks with reasonable 
LI miss rates and low L2 miss rates (such as mgrid and gzip) perform well. 


To close this section, let's look at the relative performance of the Pentium 4 and 
AMD Opteron for this subset of the SPEC benchmarks. The AMD Opteron and 
Intel Pentium 4 share a number of similarities: 
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e Both use a dynamically scheduled, speculative pipeline capable of issuing 
and committing three [A-32 instructions per clock. 


e Both use a two-level on-chip cache structure, although the Pentium 4 uses a 
trace cache for the first-level instruction cache and recent Pentium 4 imple- 
mentations have larger second-level caches. 


e They have similar transistor counts, die size, and power, with the Pentium 4 
being about 7% to 10% higher on all three measures at the highest clock rates 
available in 2005 for these two processors. 


The most significant difference is the very deep pipeline of the Intel Netburst 
microarchitecture, which was designed to allow higher clock rates. Although com- 
pilers optimized for the two architectures produce slightly different code 
sequences, comparing CPI measures can provide important insights into how 
these two processors compare. Remember that differences in the memory hierar- 
chy as well as differences in the pipeline structure will affect these measurements; 
we analyze the differences in memory system performance in Chapter 5. Figure 
2.32 shows the CPI measures for a set of SPEC CPU2000 benchmarks for a 3.2 
GHz Pentium 4 and a 2.6 GHz AMD Opteron. At these clock rates, the Opteron 
processor has an average improvement in CPI by 1.27 over the Pentium 4. 

Of course, we should expect the Pentium 4, with its much deeper pipeline, to 
have a somewhat higher CPI than the AMD Opteron. The key question for the 
very deeply pipelined Netburst design is whether the increase in clock rate, 
which the deeper pipelining allows, overcomes the disadvantages of a higher 
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Figure 2.32 A 2.6 GHz AMD Opteron has a lower CPI by a factor of 1.27 versus a 3.2 
GHz Pentium 4. 
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Figure 2.33 The performance of a 2.8 GHz AMD Opteron versus a 3.8 GHz Intel Pen- 
tium 4 shows a performance advantage for the Opteron of about 1.08. 


CPI. We examine this by showing the SPEC CPU2000 performance for these two 
processors at their highest available clock rate of these processors in 2005: 2.8 
GHz for the Opteron and 3.8 GHz for the Pentium 4. These higher clock rates 
will increase the effective CPI measurement versus those in Figure 2.32, since the 
cost of a cache miss will increase. Figure 2.33 shows the relative performance on 
the same subset of SPEC as Figure 2.32. The Opteron is slightly faster, meaning 
that the higher clock rate of the Pentium 4 is insufficient to overcome the higher 
CPI arising from more pipeline stalls. 

Hence, while the Pentium 4 performs well, it is clear that the attempt to 
achieve both high clock rates via a deep pipeline and high instruction throughput 
via multiple issue is not as successful as the designers once believed it would be. 
We discuss this topic in depth in the next chapter. 


Fallacies and Pitfalls 


Our first fallacy has two parts: First, simple rules do not hold, and, second, the 
choice of benchmarks plays a major role. 


Processors with lower CPIs will always be faster. 
Processors with faster clock rates will always be faster. 


Although a lower CPI is certainly better, sophisticated multiple-issue pipelines 
typically have slower clock rates than processors with simple pipelines. In appli- 
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cations with limited ILP or where the parallelism cannot be exploited by the 
hardware resources, the faster clock rate often wins. But, when significant ILP 
exists, a processor that exploits lots of ILP may be better. 

The IBM PowerS processor is designed for high-performance integer and FP; 
it contains two processor cores each capable of sustaining four instructions per 
clock, including two FP and two load-store instructions. The highest clock rate 
for a Power5 processor in 2005 is 1.9 GHz. In comparison, the Pentium 4 offers a 
single processor with multithreading (see the next chapter). The processor can 
sustain three instructions per clock with a very deep pipeline, and the maximum 
available clock rate in 2005 is 3.8 GHz. 

Thus, the Power5 will be faster if the product of the instruction count and CPI 
is less than one-half the same product for the Pentium 4. As Figure 2.34 shows 
the CPI x instruction count advantages of the PowerS are significant for the FP 
programs, sometimes by more than a factor of 2, while for the integer programs 
the CPI x instruction count advantage of the PowerS is usually not enough to 
overcome the clock rate advantage of the Pentium 4. By comparing the SPEC 
numbers, we find that the product of instruction count and CPI advantage for the 
PowerS is 3.1 times on the floating-point programs but only 1.5 times on the inte- 
ger programs. Because the maximum clock rate of the Pentium 4 in 2005 is 
exactly twice that of the Power5, the PowerS is faster by 1.5 on SPECfp2000 and 
the Pentium 4 will be faster by 1.3 on SPECint2000. 
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Figure 2.34 A comparison of the 1.9 GHZ IBM Power5 processor versus the 3.8 GHz 
Intel Pentium 4 for 20 SPEC benchmarks (10 integer on the left and 10 floating 
point on the right) shows that the higher clock Pentium 4 is generally faster for the 
integer workload, while the lower CPI Power5 is usually faster for the floating- 
point workload. 
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Pitfall 


2.12 


Sometimes bigger and dumber is better. 


Advanced pipelines have focused on novel and increasingly sophisticated 
schemes for improving CPI. The 21264 uses a sophisticated tournament predictor 
with a total of 29K bits (see page 88), while the earlier 21164 uses a simple 2-bit 
predictor with 2K entries (or a total of 4K bits). For the SPEC95 benchmarks, the 
more sophisticated branch predictor of the 21264 outperforms the simpler 2-bit 
scheme on all but one benchmark. On average, for SPECint95, the 21264 has 
11.5 mispredictions per 1000 instructions committed, while the 21164 has about 
16.5 mispredictions. 

Somewhat surprisingly, the simpler 2-bit scheme works better for the 
transaction-processing workload than the sophisticated 21264 scheme (17 
mispredictions versus 19 per 1000 completed instructions)! How can a predictor 
with less than 1/7 the number of bits and a much simpler scheme actually work 
better? The answer lies in the structure of the workload. The transaction- 
processing workload has a very large code size (more than an order of magnitude 
larger than any SPEC95 benchmark) with a large branch frequency. The ability of 
the 21164 predictor to hold twice as many branch predictions based on purely 
local behavior (2K versus the IK local predictor in the 21264) seems to provide a 
slight advantage. 

This pitfall also reminds us that different applications can produce different 
behaviors. As processors become more sophisticated, including specific microar- 
chitectural features aimed at some particular program behavior, it is likely that 
different applications will see more divergent behavior. 


Concluding Remarks 


The tremendous interest in multiple-issue organizations came about because of 
an interest in improving performance without affecting the standard uniprocessor 
programming model. Although taking advantage of ILP is conceptually simple, 
the design problems are amazingly complex in practice. It is extremely difficult 
to achieve the performance you might expect from a simple first-level analysis. 

Rather than embracing dramatic new approaches in microarchitecture, most 
of the last 10 years have focused on raising the clock rates of multiple-issue pro- 
cessors and narrowing the gap between peak and sustained performance. The 
dynamically scheduled, multiple-issue processors announced in the last five years 
(the Pentium 4, IBM Power5, and the AMD Athlon and Opteron) have the same 
basic structure and similar sustained issue rates (three to four instructions per 
clock) as the first dynamically scheduled, multiple-issue processors announced in 
1995! But the clock rates are 10-20 times higher, the caches are 4-8 times bigger, 
there are 2-4 times as many renaming registers, and twice as many load-store 
units! The result is performance that is 8-16 times higher. 
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The trade-offs between increasing clock speed and decreasing CPI through 
multiple issue are extremely hard to quantify. In the 1995 edition of this book, we 
stated: 


Although you might expect that it is possible to build an advanced multiple-issue 
processor with a high clock rate, a factor of 15 to 2 in clock rate has consistently 
separated the highest clock rate processors and the most sophisticated multiple- 
issue processors. It is simply too early to tell whether this difference is due to 
fundamental implementation trade-offs, or to the difficulty of dealing with the 
complexities in multiple-issue processors, or simply a lack of experience in 
implementing such processors. 


Given the availability of the Pentium 4 at 3.8 GHz, it has become clear that 
the limitation was primarily our understanding of how to build such processors. 
As we will see in the next chapter, however, it appears unclear that the initial suc- 
cess in achieving high-clock-rate processors that issue three to four instructions 
per clock can be carried much further due to limitations in available ILP, effi- 
ciency in exploiting that ILP, and power concerns. In addition, as we saw in the 
comparison of the Opteron and Pentium 4, it appears that the performance advan- 
tage in high clock rates achieved by very deep pipelines (20-30 stages) is largely 
lost by additional pipeline stalls. We analyze this behavior further in the next 
chapter. 

One insight that was clear in 1995 and has become even more obvious in 
2005 is that the peak-to-sustained performance ratios for multiple-issue proces- 
sors are often quite large and typically grow as the issue rate grows. The lessons 
to be gleaned by comparing the Power5 and Pentium 4, or the Pentium 4 and 
Pentium HI (which differ primarily in pipeline depth and hence clock rate, rather 
than issue rates), remind us that it is difficult to generalize about clock rate versus 
CPI, or about the underlying trade-offs in pipeline depth, issue rate, and other 
characteristics. 

A change in approach is clearly upon us. Higher-clock-rate versions of the 
Pentium 4 have been abandoned. IBM has shifted to putting two processors on a 
single chip in the Power4 and Power5 series, and both Intel and AMD have deliv- 
ered early versions of two-processor chips. We will return to this topic in the next 
chapter and indicate why the 20-year rapid pursuit of ILP seems to have reached 
its end. 


Historical Perspective and References 


Section K.4 on the companion CD features a discussion on the development of 
pipelining and instruction-level parallelism. We provide numerous references for 
further reading and exploration of these topics. 
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Case Studies with Exercises by Robert P.Colwell 


Case Study 1: Exploring the Impact of Microarchitectural 
Techniques 


Concepts illustrated by this case study 


m Basic Instruction Scheduling, Reordering, Dispatch 
e Multiple Issue and Hazards 

e Register Renaming 

e Out-of-Order and Speculative Execution 


e Where to Spend Out-of-Order Resources 


You are tasked with designing a new processor microarchitecture, and you are 
trying to figure out how best to allocate your hardware resources. Which of the 
hardware and software techniques you learned in Chapter 2 should you apply? 
You have a list of latencies for the functional units and for memory, as well as 
some representative code. Your boss has been somewhat vague about the perfor- 
mance requirements of your new design, but you know from experience that, all 
else being equal, faster is usually better. Start with the basics. Figure 2.35 pro- 
vides a sequence of instructions and list of latencies. 


[10] <1.8, 2.1, 2.2> What would be the baseline performance (in cycles, per loop 
iteration) of the code sequence in Figure 2.35 if no new instruction execution 
could be initiated until the previous instruction execution had completed? Ignore 
front-end fetch and decode. Assume for now that execution does not stall for lack 
of the next instruction, but only one instruction/cycle can be issued. Assume the 
branch is taken, and that there is a 1 cycle branch delay slot. 


Latencies beyond single cycle 





Loop: LD F2,0(Rx) Memory LD +3 
10 MULTD F2,F0,F2 Memory SD +1 
11 DIVD F8,F2,F0 Integer ADD, SUB +0 
12 LD F4,0(Ry) Branches +1 
13 ADDD F4,F0,F4 ADDD +2 
14 ADDD F10,F8,F2 MULTD +4 
15 SD F4,0(Ry) DIVD +10 





16 ADDI Rx,Rx,#8 
17 ADDI Ry,Ry,#8 
18 SUB R20,R4,Rx 
19 BNZ R20,Loop 








Figure 2.35 Code and latencies for Exercises 2.1 through 2.6. 
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[10] <1.8, 2.1, 2.2> Think about what latency numbers really mean—they indicate 
the number of cycles a given function requires to produce its output, nothing more. 
If the overall pipeline stalls for the latency cycles of each functional unit, then you 
are at least guaranteed that any pair of back-to-back instructions (a "producer" fol- 
lowed by a "consumer") will execute correctly. But not all instruction pairs have a 
producer/consumer relationship. Sometimes two adjacent instructions have nothing 
to do with each other. How many cycles would the loop body in the code sequence 
in Figure 2.35 require if the pipeline detected true data dependences and only 
stalled on those, rather than blindly stalling everything just because one functional 
unit is busy? Show the code with <stal 1 > inserted where necessary to accommo- 
date stated latencies. (Hint: An instruction with latency "+2" needs 2 <stall> 
cycles to be inserted into the code sequence. Think of it this way: a 1-cycle instruc- 
tion has latency 1+0, meaning zero extra wait states. So latency 1 + 1 implies 1 
stall cycle; latency 1 + N has N extra stall cycles.) 


[15] <2.6, 2.7> Consider a multiple-issue design. Suppose you have two execu- 
tion pipelines, each capable of beginning execution of one instruction per cycle, 
and enough fetch/decode bandwidth in the front end so that it will not stall your 
execution. Assume results can be immediately forwarded from one execution unit 
to another, or to itself. Further assume that the only reason an execution pipeline 
would stall is to observe a true data dependence. Now how many cycles does the 
loop require? 


[10] <2.6, 2.7> In the multiple-issue design of Exercise 2.3, you may have recog- 
nized some subtle issues. Even though the two pipelines have the exact same 
instruction repertoire, they are not identical nor interchangeable, because there is 
an implicit ordering between them that must reflect the ordering of the instruc- 
tions in the original program. If instruction N + J begins execution in Execution 
Pipe 1 at the same time that instruction TV begins in Pipe 0, and N + 1 happens to 
require a shorter execution latency than N, then N + J will complete before N 
(even though program ordering would have implied otherwise). Recite at least 
two reasons why that could be hazardous and will require special considerations 
in the microarchitecture. Give an example of two instructions from the code in 
Figure 2.35 that demonstrate this hazard. 


[20] <2.7> Reorder the instructions to improve performance of the code in Figure 
2.35. Assume the two-pipe machine in Exercise 2.3, and that the out-of-order 
completion issues of Exercise 2.4 have been dealt with successfully. Just worry 
about observing true data dependences and functional unit latencies for now. 
How many cycles does your reordered code take? 


[10/10] <2.1, 2.2> Every cycle that does not initiate a new operation in a pipe is a 
lost opportunity, in the sense that your hardware is not "living up to its potential." 


a. [10] <2.1, 2.2> In your reordered code from Exercise 2.5, what fraction of all 
cycles, counting both pipes, were wasted (did not initiate a new op)? 


b. [10] <2.1, 2.2> Loop unrolling is one standard compiler technique for finding 
more parallelism in code, in order to minimize the lost opportunities for per- 
formance. 
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2.7 


c. Hand-unroll two iterations of the loop in your reordered code from Exercise 
2.5. What speedup did you obtain? (For this exercise, just color the N+ 1 iter- 
ation's instructions green to distinguish them from the Mh iteration's; if you 
were actually unrolling the loop you would have to reassign registers to pre- 
vent collisions between the iterations.) 


[15] <2.1> Computers spend most of their time in loops, so multiple loop itera- 
tions are great places to speculatively find more work to keep CPU resources 
busy. Nothing is ever easy, though; the compiler emitted only one copy of that 
loop's code, so even though multiple iterations are handling distinct data, they 
will appear to use the same registers. To keep register usages multiple iterations 
from colliding, we rename their registers. Figure 2.36 shows example code that 
we would like our hardware to rename. 


A compiler could have simply unrolled the loop and used different registers to 
avoid conflicts, but if we expect our hardware to unroll the loop, it must also do 
the register renaming. How? Assume your hardware has a pool of temporary reg- 
isters (call them T registers, and assume there are 64 of them, TO through T63) 
that it can substitute for those registers designated by the compiler. This rename 
hardware is indexed by the source register designation, and the value in the table 
is the T register of the last destination that targeted that register. (Think of these 
table values as producers, and the src registers are the consumers; it doesn't much 
matter where the producer puts its result as long as its consumers can find it.) 
Consider the code sequence in Figure 2.36. Every time you see a destination reg- 
ister in the code, substitute the next available T, beginning with T9. Then update 
all the src registers accordingly, so that true data dependences are maintained. 
Show the resulting code. (Hint: See Figure 2.37.) 


Loop: LD F'2, 0 (Rx) 
O's MULTD F5,F0,F2 
DIVD F8,F0,F2 
LD F4,0 (Ry) 
ADDD F6,F0,F4 
ADDD F10,F8,F2 
SD F4, 0 (Ry) 











a AeA W N e 


Figure 2.36 Sample code for register renaming practice. 


10: LD  T9,0(Rx) 
Il: MLD T10,F0,T9 


Figure 2.37 Expected output of register renaming. 
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I0: MULTD F5,F0,F2 
Il: ADDD  F9,F5,F4 
I2: ADDD F5,F5,F2 
I3: DIVD F2,F9,FO 


Figure 2.38 Sample code for superscalar register renaming. 


[20] <2.4> Exercise 2.7 explored simple register renaming: when the hardware 
register renamer sees a source register, it substitutes the destination T register of 
the last instruction to have targeted that source register. When the rename table 
sees a destination register, it substitutes the next available T for it. But superscalar 
designs need to handle multiple instructions per clock cycle at every stage in the 
machine, including the register renaming. A simple scalar processor would there- 
fore look up both src register mappings for each instruction, and allocate a new 
destination mapping per clock cycle. Superscalar processors must be able to do 
that as well, but they must also ensure that any dest-to-src relationships between 
the two concurrent instructions are handled correctly. Consider the sample code 
sequence in Figure 2.38. Assume that we would like to simultaneously rename 
the first two instructions. Further assume that the next two available T registers to 
be used are known at the beginning of the clock cycle in which these two instruc- 
tions are being renamed. Conceptually, what we want is for the first instruction to 
do its rename table lookups, and then update the table per its destination's T reg- 
ister. Then the second instruction would do exactly the same thing, and any inter- 
instruction dependency would thereby be handled correctly. But there's not 
enough time to write that T register designation into the renaming table and then 
look it up again for the second instruction, all in the same clock cycle. That regis- 
ter substitution must instead be done live (in parallel with the register rename 
table update). Figure 2.39 shows a circuit diagram, using multiplexers and com- 
parators, that will accomplish the necessary on-the-fly register renaming. Your 
task is to show the cycle-by-cycle state of the rename table for every instruction 
of the code. Assume the table starts out with every entry equal to its index (TO = 0; 
Tl = 1,...). 


[5] <2.4> If you ever get confused about what a register renamer has to do, go 
back to the assembly code you're executing, and ask yourself what has to happen 
for the right result to be obtained. For example, consider a three-way superscalar 
machine renaming these three instructions concurrently: 


ADDI RI, RI, RI 
ADDI RI, RI, RI 
ADDI RI, RI, RI 


If the value of RI starts out as 5, what should its value be when this sequence has 
executed? 
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Next available T register Rename table 
T m4 T 0 . 6 5 | 
On a. == ae 
" This 9 appears ™ t- = 2 f- 39 —}-——_, 
in the rename fis ais] 
table in next 4 14 ahr 
clock cycle 5 ERTE at Swe 
Ls 
B | -8 ; 
È y ioe | ` 
i dst =F5 =- | ~28 f. dst =T9 
src1=FO tee | A | L> srci=T6 
src2 = F2- al T —> src2 =T39 
63 
(As per instr 1) y N A 
t 11 dst =12 src? ——— dst =T10 
i dst =F9 — | — srci=T9 
E- srci=F5 (Similar mux———> src2 =T14 
src2 = F4-__! for src 2) 


Figure 2.39 Rename table and on-the-fly register substitution logic for superscalar 
machines. (Note:"src" is source,"dst" is destination.) 


Loop: LW R1,0(R2) ; LW R3,8(R2) 
<Stall> 
<stall> 
ADDI R10,R1,#1; ADDI R11,R3,#1 
SW R1,0(R2) ; SW R3,8(R2) 
ADDI R2,R2,#8 
SUB R4,R3,R2 
BNZ R4,Loop 


Figure 2.40 Sample VLIW code with two adds, two loads, and two stalls. 


[20] <2.4, 2.9> VLIW designers have a few basic choices to make regarding 
architectural rules for register use. Suppose a VLIW is designed with self-drain- 
ing execution pipelines: once an operation is initiated, its results will appear in 
the destination register at most L cycles later (where L is the latency of the opera- 
tion). There are never enough registers, so there is a temptation to wring maxi- 
mum use out of the registers that exist. Consider Figure 2.40. If loads have a 1 + 
2 cycle latency, unroll this loop once, and show how a VLIW capable of two 
loads and two adds per cycle can use the minimum number of registers, in the 
absence of any pipeline interruptions or stalls. Give an example of an event that, 
in the presence of self-draining pipelines, could disrupt this pipelining and yield 
wrong results. 
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[10/10/10] <2.3> Assume a five-stage single-pipeline microarchitecture (fetch, 
decode, execute, memory, write back) and the code in Figure 2.41. All ops are 1 
cycle except LW and SW, which are | + 2 cycles, and branches, which are 1 + 1 
cycles. There is no forwarding. Show the phases of each instruction per clock 
cycle for one iteration of the loop. 


a. [10] <2.3> How many clock cycles per loop iteration are lost to branch over- 
head? 


b. [10] <2.3> Assume a static branch predictor, capable of recognizing a back- 
wards branch in the decode stage. Now how many clock cycles are wasted on 
branch overhead? 


c. [10] <2.3> Assume a dynamic branch predictor. How many cycles are lost on 
a correct prediction? 


[20/20/20/10/20] <2.4, 2.7, 2.10> Let's consider what dynamic scheduling might 
achieve here. Assume a microarchitecture as shown in Figure 2.42. Assume that 
the ALUs can do all arithmetic ops (MULID, DIVD, ADDD, ADDI, SUB) and branches, 
and that the Reservation Station (RS) can dispatch at most one operation to each 
functional unit per cycle (one op to each ALU plus one memory op to the LD/ST 
unit). 





Loop: LW R1, 0 (R2) 
ADDI R1,R1,#1 


ADDI R2,R2,#4 
SUB R4,R3,R2 
BNZ R4, Loop 








Figure 2.41 Code loop for Exercise 2.11. 


Instructions 
from decoder 


(i—i — 
Reservation ALU 1 
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— _— 


LD/ST = Mem 














Figure 2.42 An out-of-order microarchitecture. 
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a. 


[15] <2.4> Suppose all of the instructions from the sequence in Figure 2.35 
are present in the RS, with no renaming having been done. Highlight any 
instructions in the code where register renaming would improve performance. 
Hint: Look for RAW and WAW hazards. Assume the same functional unit 
latencies as in Figure 2.35. 


[20] <2.4> Suppose the register-renamed version of the code from part (a) is 
resident in the RS in clock cycle N, with latencies as given in Figure 2.35. 
Show how the RS should dispatch these instructions out-of-order, clock by 
clock, to obtain optimal performance on this code. (Assume the same RS 
restrictions as in part (a). Also assume that results must be written into the RS 
before they're available for use; i.e., no bypassing.) How many clock cycles 
does the code sequence take? 


[20] <2.4> Part (b) lets the RS try to optimally schedule these instructions. 
But in reality, the whole instruction sequence of interest is not usually present 
in the RS. Instead, various events clear the RS, and as a new code sequence 
streams in from the decoder, the RS must choose to dispatch what it has. Sup- 
pose that the RS is empty. In cycle 0 the first two register-renamed instruc- 
tions of this sequence appear in the RS. Assume it takes 1 clock cycle to 
dispatch any op, and assume functional unit latencies are as they were for 
Exercise 2.2. Further assume that the front end (decoder/register-renamer) 
will continue to supply two new instructions per clock cycle. Show the cycle- 
by-cycle order of dispatch of the RS. How many clock cycles does this code 
sequence require now? 


[10] <2.10> If you wanted to improve the results of part (c), which would 
have helped most: (1) another ALU; (2) another LD/ST unit; (3) full bypass- 
ing of ALU results to subsequent operations; (4) cutting the longest latency in 
half? What's the speedup? 


[20] <2.7> Now let's consider speculation, the act of fetching, decoding, and 
executing beyond one or more conditional branches. Our motivation to do 
this is twofold: the dispatch schedule we came up with in part (c) had lots of 
nops, and we know computers spend most of their time executing loops 
(which implies the branch back to the top of the loop is pretty predictable.) 
Loops tell us where to find more work to do; our sparse dispatch schedule 
suggests we have opportunities to do some of that work earlier than before. In 
part (d) you found the critical path through the loop. Imagine folding a sec- 
ond copy of that path onto the schedule you got in part (b). How many more 
clock cycles would be required to do two loops’ worth of work (assuming all 
instructions are resident in the RS)? (Assume all functional units are fully 
pipelined.) 
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Case Study 2: Modeling a Branch Predictor 


Concept illustrated by this cose study 
m Modeling a Branch Predictor 


Besides studying microarchitecture techniques, to really understand computer 
architecture you must also program computers. Getting your hands dirty by 
directly modeling various microarchitectural ideas is better yet. Write a C or Java 
program to model a 2,1 branch predictor. Your program will read a series of lines 
from a file named history.txt (available on the companion CD—see Figure Figure 
2.43). 

Each line of that file has three data items, separated by tabs. The first datum 
on each line is the address of the branch instruction in hex. The second datum is 
the branch target address in hex. The third datum is a 1 or a 0; 1 indicates a taken 
branch, and 0 indicates not taken. The total number of branches your model will 
consider is, of course, equal to the number of lines in the file. Assume a direct- 
mapped BTB, and don't worry about instruction lengths or alignment (i.e., if 
your BTB has four entries, then branch instructions at 0x0, 0x1, 0x2, and 0x3 
will reside in those four entries, but a branch instruction at 0x4 will overwrite 
BTB[0]). For each line in the input file, your model will read the pair of data val- 
ues, adjust the various tables per the branch predictor being modeled, and collect 
key performance statistics. The final output of your program will look like that 
shown in Figure 2.44. 


Make the number of BTB entries in your model a command-line option. 
[20/10/10/10/10/10/10] <2.3> Write a model of a simple four-state branch target 
buffer with 64 entries. 


a. [20] <2.3> What is the overall hit rate in the BTB (the fraction of times a 
branch was looked up in the BTB and found present)? 


0x40074cdb 0x40074cdf 
0x40074ce2 0x40078d12 0 
0x4009a247 0x4009a2bb 0 
0x4009a259 0x4009a2c8 0 
0x4009a267 0x4009a2ac 1 
0x4009a2b4 0x4009a2ac 1 
' t 
| | 
Address of branch Branch target 1: taken 
instruction address 0: not taken 


Figure 2.43 Sample history.txt input file format. 
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b. 


[10] <2.3> What is the overall branch misprediction rate on a cold start (the 
fraction of times a branch was correctly predicted taken or not taken, regard- 
less of whether that prediction "belonged to" the branch being predicted)? 


[10] <2.3> Find the most common branch. What was its contribution to the 
overall number of correct predictions? (Hint: Count the number of times that 
branch occurs in the history.txt file, then track how each instance of that 
branch fares within the BTB model.) 


[10] <2.3> How many capacity misses did your branch predictor suffer? 


[10] <2.3> What is the effect of a cold start versus a warm start? To find out, 
run the same input data set once to initialize the history table, and then again 
to collect the new set of statistics. 

[10] <2.3> Cold-start the BTB 4 more times, with BTB sizes 16, 32, and 64. 
Graph the resulting five misprediction rates. Also graph the five hit rates. 


[10] Submit the well-written, commented source code for your branch target 
buffer model. 
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Exercise 2.13 (a) 
Number of hits BIB: 54390. Total brs: 55493. Hit rate: 99.8% 


Exercise 2.13(b) 
Incorrect predictions: 1562 of 55493, or 2.8% 


Exercise 2.13 (c) 
<a simple unix command line shell script will give you the most 
common branch...show how you got it here.> 
Most signif. branch seen 15418 times, out of 55493 tot brs ; 
27.8% 
MS branch = 0x80484ef, correct predictions = 19151 (of 36342 
total correct preds) or 52.7% 


Exercise 2.13 (d) 
Total unique branches (1 miss per br compulsory): 121 
Total misses seen: 104. 
So total capacity misses = total misses - compulsory misses = 17 
Exercise 2.13 (e) 
Number of hits in BTB: 54390. Total brs: 55493. Hit rate: 99.8% 
Incorrect predictions: 1103 of 54493, or 2.0% 


Exercise 2.13 (f) 


BIB Length mispredict rate 


1 32.91% 
2 6.42% 
4 0.28% 
8 0.23% 

16 0.21% 
32 0.20% 
64 0.20% 


Figure 2.44 Sample program output format. 
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Limits on Instruction-Level 
Parallelism 


Processors are being produced with the potential for very many 
parallel operations on the instruction level.... Far greater extremes in 
instruction-level parallelism are on the horizon. 


J.Fisher 
(1981), in the paper that inaugurated 
the term "instruction-level parallelism" 


One ofthe surprises about IA-64 is that we hear no claims of high 
frequency, despite claims that an EPIC processor is less complex than 
a superscalar processor. It's hard to know why this is so, but one can 
speculate that the overall complexity involved in focusing on CPI, as 
IA-64 does, makes it hard to get high megahertz. 


M. Hopkins 
(2000), ina commentary on the IA-64 architecture, 
ajoint development of HP and Intel designed to achieve dra- 
matic increases in the exploitation 
of ILP while retaining a simple architecture, 
which would allow higher performance 
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3.1 


3.2 


Introduction 


As we indicated in the last chapter, exploiting ILP was the primary focus of pro- 
cessor designs for about 20 years starting in the mid-1980s. For the first 15 years, 
we saw a progression of successively more sophisticated schemes for pipelining, 
multiple issue, dynamic scheduling and speculation. Since 2000, designers have 
focused primarily on optimizing designs or trying to achieve higher clock rates 
without increasing issue rates. As we indicated in the close of the last chapter, 
this era of advances in exploiting ILP appears to be coming to an end. 

In this chapter we begin by examining the limitations on ILP from program 
structure, from realistic assumptions about hardware budgets, and from the accu- 
racy of important techniques for speculation such as branch prediction. In Sec- 
tion 3.5, we examine the use of thread-level parallelism as an alternative or 
addition to instruction-level parallelism. Finally, we conclude the chapter by 
comparing a set of recent processors both in performance and in efficiency mea- 
sures per transistor and per watt. 


Studies of the Limitations of ILP 


Exploiting ILP to increase performance began with the first pipelined processors 
in the 1960s. In the 1980s and 1990s, these techniques were key to achieving 
rapid performance improvements. The question of how much ILP exists was 
critical to our long-term ability to enhance performance at a rate that exceeds the 
increase in speed of the base integrated circuit technology. On a shorter scale, the 
critical question of what is needed to exploit more ILP is crucial to both com- 
puter designers and compiler writers. The data in this section also provide us with 
a way to examine the value of ideas that we have introduced in the last chapter, 
including memory disambiguation, register renaming, and speculation. 

In this section we review one of the studies done of these questions. The his- 
torical perspectives section in Appendix K describes several studies, including 
the source for the data in this section (Wall's 1993 study). All these studies of 
available parallelism operate by making a set of assumptions and seeing how 
much parallelism is available under those assumptions. The data we examine 
here are from a study that makes the fewest assumptions; in fact, the ultimate 
hardware model is probably unrealizable. Nonetheless, all such studies assume a 
certain level of compiler technology, and some of these assumptions could affect 
the results, despite the use of incredibly ambitious hardware. 

In the future, advances in compiler technology together with significantly 
new and different hardware techniques may be able to overcome some limitations 
assumed in these studies; however, it is unlikely that such advances when coupled 
with realistic hardware will overcome these limits in the near future. For exam- 
ple, value prediction, which we examined in the last chapter, can remove data 
dependence limits. For value prediction to have a significant impact on perfor- 
mance, however, predictors would need to achieve far higher prediction accuracy 
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than has so far been reported. Indeed for reasons we discuss in Section 3.6, we 
are likely reaching the limits of how much ILP can be exploited efficiently. This 
section will lay the groundwork to understand why this is the case. 


The Hardware Model 


To see what the limits of ILP might be, we first need to define an ideal processor. 
An ideal processor is one where all constraints on ILP are removed. The only 
limits on ILP in such a processor are those imposed by the actual data flows 
through either registers or memory. 

The assumptions made for an ideal or perfect processor are as follows: 


1. Register renaming—There are an infinite number of virtual registers avail- 
able, and hence all WAW and WAR hazards are avoided and an unbounded 
number of instructions can begin execution simultaneously. 


2. Branch prediction—Branch prediction is perfect. All conditional branches 
are predicted exactly. 


3. Jump prediction—All jumps (including jump register used for return and 
computed jumps) are perfectly predicted. When combined with perfect 
branch prediction, this is equivalent to having a processor with perfect specu- 
lation and an unbounded buffer of instructions available for execution. 


4. Memory address alias analysis—All memory addresses are known exactly, 
and a load can be moved before a store provided that the addresses are not 
identical. Note that this implements perfect address alias analysis. 


5. Perfect caches—All memory accesses take 1 clock cycle. In practice, super- 
scalar processors will typically consume large amounts of ILP hiding cache 
misses, making these results highly optimistic. 


Assumptions 2 and 3 eliminate all control dependences. Likewise, assump- 
tions 1 and 4 eliminate all but the true data dependences. Together, these four 
assumptions mean that any instruction in the program's execution can be sched- 
uled on the cycle immediately following the execution of the predecessor on 
which it depends. It is even possible, under these assumptions, for the last 
dynamically executed instruction in the program to be scheduled on the very first 
cycle! Thus, this set of assumptions subsumes both control and address specula- 
tion and implements them as if they were perfect. 

Initially, we examine a processor that can issue an unlimited number of 
instructions at once looking arbitrarily far ahead in the computation. For all the 
processor models we examine, there are no restrictions on what types of instruc- 
tions can execute in a cycle. For the unlimited-issue case, this means there may 
be an unlimited number of loads or stores issuing in | clock cycle. In addition, all 
functional unit latencies are assumed to be 1 cycle, so that any sequence of 
dependent instructions can issue on successive cycles. Latencies longer than 1 
cycle would decrease the number of issues per cycle, although not the number of 
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instructions under execution at any point. (The instructions in execution at any 
point are often referred to as inflight) 

Of course, this processor is on the edge of unrealizable. For example, the 
IBM Power5 is one of the most advanced superscalar processors announced to 
date. The PowerS5 issues up to four instructions per clock and initiates execution 
on up to six (with significant restrictions on the instruction type, e.g., at most two 
load-stores), supports a large set of renaming registers (88 integer and 88 floating 
point, allowing over 200 instructions in flight, of which up to 32 can be loads and 
32 can be stores), uses a large aggressive branch predictor, and employs dynamic 
memory disambiguation. After looking at the parallelism available for the perfect 
processor, we will examine the impact of restricting various features. 

To measure the available parallelism, a set of programs was compiled and 
optimized with the standard MIPS optimizing compilers. The programs were 
instrumented and executed to produce a trace of the instruction and data refer- 
ences. Every instruction in the trace is then scheduled as early as possible, limited 
only by the data dependences. Since a trace is used, perfect branch prediction and 
perfect alias analysis are easy to do. With these mechanisms, instructions may be 
scheduled much earlier than they would otherwise, moving across large numbers 
of instructions on which they are not data dependent, including branches, since 
branches are perfectly predicted. 

Figure 3.1 shows the average amount of parallelism available for six of the 
SPEC92 benchmarks. Throughout this section the parallelism is measured by the 
average instruction issue rate. Remember that all instructions have a 1-cycle 
latency; a longer latency would reduce the average number of instructions per 
clock. Three of these benchmarks (fpppp, doduc, and tomcatv) are floating-point 
intensive, and the other three are integer programs. Two of the floating-point 
benchmarks (fpppp and tomcatv) have extensive parallelism, which could be 
exploited by a vector computer or by a multiprocessor (the structure in fpppp is 
quite messy, however, since some hand transformations have been done on the 
code). The doduc program has extensive parallelism, but the parallelism does not 
occur in simple parallel loops as it does in fpppp and tomcatv. The program li is a 
LISP interpreter that has many short dependences. 

In the next few sections, we restrict various aspects of this processor to show 
what the effects of various assumptions are before looking at some ambitious but 
realizable processors. 


Limitations on the Window Size and Maximum Issue Count 


To build a processor that even comes close to perfect branch prediction and per- 
fect alias analysis requires extensive dynamic analysis, since static compile time 
schemes cannot be perfect. Of course, most realistic dynamic schemes will not be 
perfect, but the use of dynamic schemes will provide the ability to uncover paral- 
lelism that cannot be analyzed by static compile time analysis. Thus, a dynamic 
processor might be able to more closely match the amount of parallelism uncov- 
ered by our ideal processor. 
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Figure 3.1 ILP available in a perfect processor for six of the SPEC92 benchmarks. The 
first three programs are integer programs, and the last three are floating-point 
programs. The floating-point programs are loop-intensive and have large amounts of 
loop-level parallelism. 


How close could a real dynamically scheduled, speculative processor come to 
the ideal processor? To gain insight into this question, consider what the perfect 
processor must do: 


1. Look arbitrarily far ahead to find a set of instructions to issue, predicting all 
branches perfectly. 


2. Rename all register uses to avoid WAR and WAW hazards. 


Determine whether there are any data dependences among the instructions in 
the issue packet; if so, rename accordingly. 


4. Determine if any memory dependences exist among the issuing instructions 
and handle them appropriately. 


5. Provide enough replicated functional units to allow all the ready instructions 
to issue. 


Obviously, this analysis is quite complicated. For example, to determine 
whether n issuing instructions have any register dependences among them, 
assuming all instructions are register-register and the total number of registers is 
unbounded, requires 


n-i 


F (n— l)n 2 
2n-2+2n-4+...+2=29 i= Dn -n 
jal = 


comparisons. Thus, to detect dependences among the next 2000 instructions—the 
default size we assume in several figures—requires almost 4 million comparisons! 
Even issuing only 50 instructions requires 2450 comparisons. This cost obviously 
limits the number of instructions that can be considered for issue at once. 

In existing and near-term processors, the costs are not quite so high, since we 
need only detect dependence pairs and the limited number of registers allows dif- 
ferent solutions. Furthermore, in a real processor, issue occurs in order, and 
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dependent instructions are handled by a renaming process that accommodates 
dependent renaming in 1 clock. Once instructions are issued, the detection of 
dependences is handled in a distributed fashion by the reservation stations or 
scoreboard. 

The set of instructions that is examined for simultaneous execution is called 
the window. Each instruction in the window must be kept in the processor, and 
the number of comparisons required every clock is equal to the maximum com- 
pletion rate times the window size times the number of operands per instruction 
(today up to 6 x 200 x 2 = 2400), since every pending instruction must look at 
every completing instruction for either of its operands. Thus, the total window 
size is limited by the required storage, the comparisons, and a limited issue rate, 
which makes a larger window less helpful. Remember that even though existing 
processors allow hundreds of instructions to be in flight, because they cannot 
issue and rename more than a handful in any clock cycle, the maximum through- 
out is likely to be limited by the issue rate. For example, if the instruction stream 
contained totally independent instructions that all hit in the cache, a large window 
would simply never fill. The value of having a window larger than the issue rate 
occurs when there are dependences or cache misses in the instruction stream. 

The window size directly limits the number of instructions that begin exe- 
cution in a given cycle. In practice, real processors will have a more limited 
number of functional units (e.g., no superscalar processor has handled more 
than two memory references per clock), as well as limited numbers of buses 
and register access ports, which serve as limits on the number of instructions 
initiated per clock. Thus, the maximum number of instructions that may issue, 
begin execution, or commit in the same clock cycle is usually much smaller 
than the window size. 

Obviously, the number of possible implementation constraints in a multiple- 
issue processor is large, including issues per clock, functional units and unit 
latency, register file ports, functional unit queues (which may be fewer than 
units), issue limits for branches, and limitations on instruction commit. Each of 
these acts as a constraint on the ILR Rather than try to understand each of these 
effects, however, we will focus on limiting the size of the window, with the 
understanding that all other restrictions would further reduce the amount of paral- 
lelism that can be exploited. 

Figure 3.2 shows the effects of restricting the size of the window from which 
an instruction can execute. As we can see in Figure 3.2, the amount of parallelism 
uncovered falls sharply with decreasing window size. In 2005, the most advanced 
processors have window sizes in the range of 64-200, but these window sizes are 
not strictly comparable to those shown in Figure 3.2 for two reasons. First, many 
functional units have multicycle latency, reducing the effective window size com- 
pared to the case where all units have single-cycle latency. Second, in real proces- 
sors the window must also hold any memory references waiting on a cache miss, 
which are not considered in this model, since it assumes a perfect, single-cycle 
cache access. 
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Figure 3.2 The effect of window size shown by each application by plotting the 
average number of instruction issues per clock cycle. 


As we can see in Figure 3.2, the integer programs do not contain nearly as 
much parallelism as the floating-point programs. This result is to be expected. 
Looking at how the parallelism drops off in Figure 3.2 makes it clear that the par- 
allelism in the floating-point cases is coming from loop-level parallelism. The 
fact that the amount of parallelism at low window sizes is not that different 
among the floating-point and integer programs implies a structure where there are 
dependences within loop bodies, but few dependences between loop iterations in 
programs such as tomcatv. At small window sizes, the processors simply cannot 
see the instructions in the next loop iteration that could be issued in parallel with 
instructions from the current iteration. This case is an example of where better 
compiler technology (see Appendix G) could uncover higher amounts of ILP, 
since it could find the loop-level parallelism and schedule the code to take advan- 
tage of it, even with small window sizes. 

We know that very large window sizes are impractical and inefficient, and 
the data in Figure 3.2 tells us that instruction throughput will be considerably 
reduced with realistic implementations. Thus, we will assume a base window 
size of 2K entries, roughly 10 times as large as the largest implementation in 
2005, and a maximum issue capability of 64 instructions per clock, also 10 
times the widest issue processor in 2005, for the rest of this analysis. As we will 
see in the next few sections, when the rest of the processor is not perfect, a 2K 
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window and a 64-issue limitation do not constrain the amount of ILP the proces- 
sor can exploit. 


The Effects of Realistic Branch and Jump Prediction 


Our ideal processor assumes that branches can be perfectly predicted: The out- 
come of any branch in the program is known before the first instruction is exe- 
cuted! Of course, no real processor can ever achieve this. Figure 3.3 shows the 
effects of more realistic prediction schemes in two different formats. Our data are 
for several different branch-prediction schemes, varying from perfect to no pre- 
dictor. We assume a separate predictor is used for jumps. Jump predictors are 
important primarily with the most accurate branch predictors, since the branch 
frequency is higher and the accuracy of the branch predictors dominates. 









Branch predictor 
E Perfect 

E Toumament predictor 
E Standard 2-bit 
O Profile-based 
E None 






gcc 







espresso 





Benchmarks 


fpppp 


doduc $ 


tomcatv 





Instruction issues per cycle 


Figure 3.3 The effect of branch-prediction schemes sorted by application. This 
graph shows the impact of going from a perfect model of branch prediction (all 
branches predicted correctly arbitrarily far anead); to various dynamic predictors (selec- 
tive and 2-bit); to compile time, profile-based prediction; and finally to using no predic- 
tor. The predictors are described precisely in the text. This graph highlights the 
differences among the programs with extensive loop-level parallelism (tomcatv and 
fpppp) and those without (the integer programs and doduc). 
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The five levels of branch prediction shown in these figure are 


1. Perfect—All branches and jumps are perfectly predicted at the start of execu- 
tion. 


2. Tournament-based branch predictor—The prediction scheme uses a correlat- 
ing 2-bit predictor and a noncorrelating 2-bit predictor together with a selec- 
tor, which chooses the best predictor for each branch. The prediction buffer 
contains 2'? (8K) entries, each consisting of three 2-bit fields, two of which 
are predictors and the third a selector. The correlating predictor is indexed 
using the exclusive-or of the branch address and the global branch history. 
The noncorrelating predictor is the standard 2-bit predictor indexed by the 
branch address. The selector table is also indexed by the branch address and 
specifies whether the correlating or noncorrelating predictor should be used. 
The selector is incremented or decremented just as we would for a standard 2- 
bit predictor. This predictor, which uses a total of 48K bits, achieves an aver- 
age misprediction rate of 3% for these six SPEC92 benchmarks and is com- 
parable in strategy and size to the best predictors in use in 2005. Jump 
prediction is done with a pair of 2K-entry predictors, one organized as a cir- 
cular buffer for predicting returns and one organized as a standard predictor 
and used for computed jumps (as in case statements or computed gotos). 
These jump predictors are nearly perfect. 


3. Standard 2-bit predictor with 512 2-bit entries—In addition, we assume a 16- 
entry buffer to predict returns. 


4. Profile-based—A static predictor uses the profile history of the program and 
predicts that the branch is always taken or always not taken based on the 
profile. 


5. None—No branch prediction is used, though jumps are still predicted. Paral- 
lelism is largely limited to within a basic block. 


Since we do not charge additional cycles for a mispredicted branch, the only 
effect of varying the branch prediction is to vary the amount of parallelism that 
can be exploited across basic blocks by speculation. Figure 3.4 shows the accu- 
racy of the three realistic predictors for the conditional branches for the subset of 
SPEC92 benchmarks we include here. 

Figure 3.3 shows that the branch behavior of two of the floating-point 
programs is much simpler than the other programs, primarily because these two 
programs have many fewer branches and the few branches that exist are more 
predictable. This property allows significant amounts of parallelism to be 
exploited with realistic prediction schemes. In contrast, for all the integer pro- 
grams and for doduc, the FP benchmark with the least loop-level parallelism, 
even the difference between perfect branch prediction and the ambitious selective 
predictor is dramatic. Like the window size data, these figures tell us that to 
achieve significant amounts of parallelism in integer programs, the processor 
must select and execute instructions that are widely separated. When branch 
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Figure 3.4 Branch misprediction rate for the conditional branches in the SPEC92 
subset. 


prediction is not highly accurate, the mispredicted branches become a barrier to 
finding the parallelism. 

As we have seen, branch prediction is critical, especially with a window size 
of 2K instructions and an issue limit of 64. For the rest of this section, in addition 
to the window and issue limit, we assume as a base a more ambitious tournament 
predictor that uses two levels of prediction and a total of 8K entries. This predic- 
tor, which requires more than 150K bits of storage (roughly four times the largest 
predictor to date), slightly outperforms the selective predictor described above 
(by about 0.5-1%). We also assume a pair of 2K jump and return predictors, as 
described above. 


The Effects of Finite Registers 


Our ideal processor eliminates all name dependences among register references 
using an infinite set of virtual registers. To date, the IBM PowerS has provided 
the largest numbers of virtual registers: 88 additional floating-point and 88 addi- 
tional integer registers, in addition to the 64 registers available in the base archi- 
tecture. All 240 registers are shared by two threads when executing in 
multithreading mode (see Section 3.5), and all are available to a single thread 
when in single-thread mode. Figure 3.5 shows the effect of reducing the number 
of registers available for renaming; both the FP and GP registers are increased by 
the number of registers shown in the legend. 
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Figure 3.5 The reduction in available parallelism is significant when fewer than an 
unbounded number of renaming registers are available.Both the number of FP regis- 
ters and the number of GP registers are increased by the number shown on thex-axis. 
So, the entry corresponding to "128 integer + 128 FP" has a total of 128 + 128 + 64 = 
320 registers (128 for integer renaming, 128 for FP renaming, and 64 integer and FP reg- 
isters present in the MIPS architecture). The effect is most dramatic on the FP programs, 
although having only 32 extra integer and 32 extra FP registers has a significant impact 
on all the programs. For the integer programs, the impact of having more than 64 extra 
registers is not seen here.To use more than 64 registers requires uncovering lots of par- 
allelism, which for the integer programs requires essentially perfect branch prediction. 


The results in this figure might seem somewhat surprising: You might expect 
that name dependences should only slightly reduce the parallelism available. 
Remember though, that exploiting large amounts of parallelism requires evaluat- 
ing many possible execution paths, speculatively. Thus, many registers are needed 
to hold live variables from these threads. Figure 3.5 shows that the impact of hav- 
ing only a finite number of registers is significant if extensive parallelism exists. 
Although this graph shows a large impact on the floating-point programs, the 
impact on the integer programs is small primarily because the limitations in win- 
dow size and branch prediction have limited the ILP substantially, making renam- 
ing less valuable. In addition, notice that the reduction in available parallelism is 
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significant even if 64 additional integer and 64 additional FP registers are available 
for renaming, which is comparable to the number of extra registers available on 
any existing processor as of 2005. 

Although register renaming is obviously critical to performance, an infinite 
number of registers is not practical. Thus, for the next section, we assume that 
there are 256 integer and 256 FP registers available for renaming—far more than 
any anticipated processor has as of 2005. 


The Effects of Imperfect Alias Analysis 


Our optimal model assumes that it can perfectly analyze all memory depen- 
dences, as well as eliminate all register name dependences. Of course, perfect 
alias analysis is not possible in practice: The analysis cannot be perfect at com- 
pile time, and it requires a potentially unbounded number of comparisons at run 
time (since the number of simultaneous memory references is unconstrained). 
Figure 3.6 shows the impact of three other models of memory alias analysis, in 
addition to perfect analysis. The three models are 


1. Global/stack perfect—-This model does perfect predictions for global and 
stack references and assumes all heap references conflict. This model repre- 
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Figure 3.6 The effect of varying levels of alias analysis on individual programs. Any- 
thing less than perfect analysis has a dramatic impact on the amount of parallelism 
found in the integer programs, and global/stack analysis is perfect (and unrealizable) 


for the FORTRAN programs. 
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sents an idealized version of the best compiler-based analysis schemes cur- 
rently in production. Recent and ongoing research on alias analysis for 
pointers should improve the handling of pointers to the heap in the future. 


2. Inspection—This model examines the accesses to see if they can be deter- 
mined not to interfere at compile time. For example, if an access uses RIO as 
a base register with an offset of 20, then another access that uses RIO as a 
base register with an offset of 100 cannot interfere, assuming RIO could not 
have changed. In addition, addresses based on registers that point to different 
allocation areas (such as the global area and the stack area) are assumed never 
to alias. This analysis is similar to that performed by many existing commer- 
cial compilers, though newer compilers can do better, at least for loop- 
oriented programs. 


3. None—All memory references are assumed to conflict. 


As you might expect, for the FORTRAN programs (where no heap references 
exist), there is no difference between perfect and global/stack perfect analysis. 
The global/stack perfect analysis is optimistic, since no compiler could ever find 
all array dependences exactly. The fact that perfect analysis of global and stack 
references is still a factor of two better than inspection indicates that either 
sophisticated compiler analysis or dynamic analysis on the fly will be required to 
obtain much parallelism. In practice, dynamically scheduled processors rely on 
dynamic memory disambiguation. To implement perfect dynamic disambigua- 
tion for a given load, we must know the memory addresses of all earlier stores 
that have not yet committed, since a load may have a dependence through mem- 
ory on a store. As we mentioned in the last chapter, memory address speculation 
could be used to overcome this limit. 


Limitations on ILP for Realizable Processors 


In this section we look at the performance of processors with ambitious levels of 
hardware support equal to or better than what is available in 2006 or likely to be 
available in the next few years. In particular we assume the following fixed 
attributes: 


1. Up to 64 instruction issues per clock with no issue restrictions, or roughly 
10 times the total issue width of the widest processor in 2005. As we dis- 
cuss later, the practical implications of very wide issue widths on clock 
rate, logic complexity, and power may be the most important limitation on 
exploiting ILR 


2. A tournament predictor with IK entries and a 16-entry return predictor. This 
predictor is fairly comparable to the best predictors in 2005; the predictor is 
not a primary bottleneck. 
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Perfect disambiguation of memory references done dynamically—this is 
ambitious but perhaps attainable for small window sizes (and hence small issue 
rates and load-store buffers) or through a memory dependence predictor. 


Register renaming with 64 additional integer and 64 additional FP registers, 
which is roughly comparable to the IBM PowerS. 


Figure 3.7 shows the result for this configuration as we vary the window size. 
This configuration is more complex and expensive than any existing implementa- 
tions, especially in terms of the number of instruction issues, which is more than 
10 times larger than the largest number of issues available on any processor in 
2005. Nonetheless, it gives a useful bound on what future implementations might 
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Figure 3.7 The amount of parallelism available versus the window size for a variety 
of integer and floating-point programs with up to 64 arbitrary instruction issues per 
clock. Although there are fewer renaming registers than the window size, the fact that 
all operations have zero latency, and that the number of renaming registers equals the 
issue width, allows the processor to exploit parallelism within the entire window. In a 
real implementation, the window size and the number of renaming registers must be 
balanced to prevent one of these factors from overly constraining the issue rate. 


Example 
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yield. The data in these figures are likely to be very optimistic for another reason. 
There are no issue restrictions among the 64 instructions: They may all be mem- 
ory references. No one would even contemplate this capability in a processor in 
the near future. Unfortunately, it is quite difficult to bound the performance of a 
processor with reasonable issue restrictions; not only is the space of possibilities 
quite large, but the existence of issue restrictions requires that the parallelism be 
evaluated with an accurate instruction scheduler, making the cost of studying pro- 
cessors with large numbers of issues very expensive. 

In addition, remember that in interpreting these results, cache misses and 
nonunit latencies have not been taken into account, and both these effects will 
have significant impact! 

The most startling observation from Figure 3.7 is that with the realistic pro- 
cessor constraints listed above, the effect of the window size for the integer pro- 
grams is not as severe as for FP programs. This result points to the key difference 
between these two types of programs. The availability of loop-level parallelism in 
two of the FP programs means that the amount of ILP that can be exploited is 
higher, but that for integer programs other factors—such as branch prediction, 
register renaming, and less parallelism to start with—are all important limita- 
tions. This observation is critical because of the increased emphasis on integer 
performance in the last few years. Indeed, most of the market growth in the last 
decade—transaction processing, web servers, and the like—depended on integer 
performance, rather than floating point. As we will see in the next section, for a 
realistic processor in 2005, the actual performance levels are much lower than 
those shown in Figure 3.7. 

Given the difficulty of increasing the instruction rates with realistic hardware 
designs, designers face a challenge in deciding how best to use the limited 
resources available on an integrated circuit. One of the most interesting trade-offs 
is between simpler processors with larger caches and higher clock rates versus 
more emphasis on instruction-level parallelism with a slower clock and smaller 
caches. The following example illustrates the challenges. 


Consider the following three hypothetical, but not atypical, processors, which we 
run with the SPEC gcc benchmark: 


1. A simple MIPS two-issue static pipe running at a clock rate of 4 GHz and 
achieving a pipeline CPI of 0.8. This processor has a cache system that yields 
0.005 misses per instruction. 


2. A deeply pipelined version of a two-issue MIPS processor with slightly 
smaller caches and a 5 GHz clock rate. The pipeline CPI of the processor is 
1.0, and the smaller caches yield 0.0055 misses per instruction on average. 


3. A speculative superscalar with a 64-entry window. It achieves one-half of the 
ideal issue rate measured for this window size. (Use the data in Figure 3.7.) 
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This processor has the smallest caches, which lead to 0.01 misses per instruc- 
tion, but it hides 25% of the miss penalty on every miss by dynamic schedul- 
ing. This processor has a 2.5 GHz clock. 


Assume that the main memory time (which sets the miss penalty) is 50 ns. Deter- 
mine the relative performance of these three processors. 


Answer First, we use the miss penalty and miss rate information to compute the contribu- 
tion to CPI from cache misses for each configuration. We do this with the follow- 
ing formula: 


Cache CPI = Misses per instruction x Miss penalty 
We need to compute the miss penalties for each system: 


Miss penalty =. Memory access time 

Clock cycle 
The clock cycle times for the processors are 250 ps, 200 ps, and 400 ps, respec- 
tively. Hence, the miss penalties are 





Miss penalty, = oo = 200 cycles 
Miss penalty, = ae = 250 cycles 

= 2 S 
Miss penalty, = e = 94 cycles 


Applying this for each cache: 


Cache CP^ = 0.005 x 200 = 1.0 
Cache CPI; = 0.0055 x 250 = 14 
Cache CPI; = 0.01 x 94 = 0.94 


We know the pipeline CPI contribution for everything but processor 3; its pipe- 
line CPI is given by 


Pipeline CPI, = —— = _1 


— — v- + =0.22 
Issue rate 9x05 45 


Now we can find the CPI for each processor by adding the pipeline and cache 
CPI contributions: 

CPI,;=0.8 + 1.0=1.8 

CPIp= 1.0+ 14 = 24 

CPI; = 0.22 + 0.94 = 1.16 
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Since this is the same architecture, we can compare instruction execution rates in 
millions of instructions per second (MIPS) to determine relative performance: 


; F CR 

Instruction execution rate = —— 

CPI 
. ; 4000 MHz í 
Instruction execution rate; = E = 2222 MIPS 
x P 5 MHz 4 
Instruction execution rate, = om = 2083 MIPS 
z ; 2500 } ; = n 
Instruction execution rate, = 2300 MHz = 2155 MIPS 


1.16 


In this example, the simple two-issue static superscalar looks best. In practice, 
performance depends on both the CPI and clock rate assumptions. 


Beyond the Limits of This Study 


Like any limit study, the study we have examined in this section has its own limi- 
tations. We divide these into two classes: limitations that arise even for the per- 
fect speculative processor, and limitations that arise for one or more realistic 
models. Of course, all the limitations in the first class apply to the second. The 
most important limitations that apply even to the perfect model are 


1. 


WAW and WAR hazards through memory—The study eliminated WAW and 
WAR hazards through register renaming, but not in memory usage. Although 
at first glance it might appear that such circumstances are rare (especially 
WAW hazards), they arise due to the allocation of stack frames. A called pro- 
cedure reuses the memory locations of a previous procedure on the stack, and 
this can lead to WAW and WAR hazards that are unnecessarily limiting. Aus- 
tin and Sohi [1992] examine this issue. 


Unnecessary dependences—With infinite numbers of registers, all but true 
register data dependences are removed. There are, however, dependences 
arising from either recurrences or code generation conventions that introduce 
unnecessary true data dependences. One example of these is the dependence 
on the control variable in a simple do loop: Since the control variable is incre- 
mented on every loop iteration, the loop contains at least one dependence. As 
we show in Appendix G, loop unrolling and aggressive algebraic optimiza- 
tion can remove such dependent computation. Wall's study includes a limited 
amount of such optimizations, but applying them more aggressively could 
lead to increased amounts of ILR In addition, certain code generation con- 
ventions introduce unneeded dependences, in particular the use of return 
address registers and a register for the stack pointer (which is incremented 
and decremented in the call/return sequence). Wall removes the effect of the 
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return address register, but the use of a stack pointer in the linkage conven- 
tion can cause "unnecessary" dependences. Postiff et al. [1999] explored the 
advantages of removing this constraint. 


3. Overcoming the data flow limit—If value prediction worked with high accu- 
racy, it could overcome the data flow limit. As of yet, none of the more than 
50 papers on the subject have achieved a significant enhancement in ILP 
when using a realistic prediction scheme. Obviously, perfect data value pre- 
diction would lead to effectively infinite parallelism, since every value of 
every instruction could be predicted a priori. 


For a less-than-perfect processor, several ideas have been proposed that could 
expose more ILP. One example is to speculate along multiple paths. This idea 
was discussed by Lam and Wilson [1992] and explored in the study covered in 
this section. By speculating on multiple paths, the cost of incorrect recovery is 
reduced and more parallelism can be uncovered. It only makes sense to evaluate 
this scheme for a limited number of branches because the hardware resources 
required grow exponentially. Wall [1993] provides data for speculating in both 
directions on up to eight branches. Given the costs of pursuing both paths, know- 
ing that one will be thrown away (and the growing amount of useless computa- 
tion as such a process is followed through multiple branches), every commercial 
design has instead devoted additional hardware to better speculation on the cor- 
rect path. 

It is critical to understand that none of the limits in this section are fundamen- 
tal in the sense that overcoming them requires a change in the laws of physics! 
Instead, they are practical limitations that imply the existence of some formidable 
barriers to exploiting additional ILP. These limitations—whether they be window 
size, alias detection, or branch prediction—tepresent challenges for designers 
and researchers to overcome! As we discuss in Section 3.6, the implications of 
ILP limitations and the costs of implementing wider issue seem to have created 
effective limitations on ILP exploitation. 


Crosscutting Issues: Hardware versus Software 
Speculation 


"Crosscutting Issues" is a section that discusses topics that involve subjects from 
different chapters. The next few chapters include such a section. 

The hardware-intensive approaches to speculation in the previous chapter and 
the software approaches of Appendix G provide alternative approaches to 
exploiting ILP. Some of the trade-offs, and the limitations, for these approaches 
are listed below: 


e To speculate extensively, we must be able to disambiguate memory refer- 
ences. This capability is difficult to do at compile time for integer programs 
that contain pointers. In a hardware-based scheme, dynamic run time disam- 
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biguation of memory addresses is done using the techniques we saw earlier 
for Tomasulo's algorithm. This disambiguation allows us to move loads past 
stores at run time. Support for speculative memory references can help over- 
come the conservatism of the compiler, but unless such approaches are used 
carefully, the overhead of the recovery mechanisms may swamp the advan- 
tages. 


e Hardware-based speculation works better when control flow is unpredictable, 
and when hardware-based branch prediction is superior to software-based 
branch prediction done at compile time. These properties hold for many inte- 
ger programs. For example, a good static predictor has a misprediction rate of 
about 16% for four major integer SPEC92 programs, and a hardware predic- 
tor has a misprediction rate of under 10%. Because speculated instructions 
may slow down the computation when the prediction is incorrect, this differ- 
ence is significant. One result of this difference is that even statically sched- 
uled processors normally include dynamic branch predictors. 


e Hardware-based speculation maintains a completely precise exception model 
even for speculated instructions. Recent software-based approaches have 
added special support to allow this as well. 


e Hardware-based speculation does not require compensation or bookkeeping 
code, which is needed by ambitious software speculation mechanisms. 


e Compiler-based approaches may benefit from the ability to see further in the 
code sequence, resulting in better code scheduling than a purely hardware- 
driven approach. 


e Hardware-based speculation with dynamic scheduling does not require dif- 
ferent code sequences to achieve good performance for different implementa- 
tions of an architecture. Although this advantage is the hardest to quantify, it 
may be the most important in the long run. Interestingly, this was one of the 
motivations for the IBM 360/91. On the other hand, more recent explicitly 
parallel architectures, such as IA-64, have added flexibility that reduces the 
hardware dependence inherent in a code sequence. 


The major disadvantage of supporting speculation in hardware is the com- 
plexity and additional hardware resources required. This hardware cost must be 
evaluated against both the complexity of a compiler for a software-based 
approach and the amount and usefulness of the simplifications in a processor that 
relies on such a compiler. We return to this topic in the concluding remarks. 

Some designers have tried to combine the dynamic and compiler-based 
approaches to achieve the best of each. Such a combination can generate interest- 
ing and obscure interactions. For example, if conditional moves are combined 
with register renaming, a subtle side effect appears. A conditional move that is 
annulled must still copy a value to the destination register, since it was renamed 
earlier in the instruction pipeline. These subtle interactions complicate the design 
and verification process and can also reduce performance. 
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3.5 


Multithreading: Using ILP Support to Exploit 
Thread-Level Parallelism 


Although increasing performance by using ILP has the great advantage that it is 
reasonably transparent to the programmer, as we have seen, ILP can be quite lim- 
ited or hard to exploit in some applications. Furthermore, there may be significant 
parallelism occurring naturally at a higher level in the application that cannot be 
exploited with the approaches discussed in this chapter. For example, an online 
transaction-processing system has natural parallelism among the multiple queries 
and updates that are presented by requests. These queries and updates can be pro- 
cessed mostly in parallel, since they are largely independent of one another. Of 
course, many scientific applications contain natural parallelism since they model 
the three-dimensional, parallel structure of nature, and that structure can be 
exploited in a simulation. 

This higher-level parallelism is called thread-level parallelism (TLP) because 
it is logically structured as separate threads of execution. A thread is a separate 
process with its own instructions and data. A thread may represent a process that 
is part of a parallel program consisting of multiple processes, or it may represent 
an independent program on its own. Each thread has all the state (instructions, 
data, PC, register state, and so on) necessary to allow it to execute. Unlike 
instruction-level parallelism, which exploits implicit parallel operations within a 
loop or straight-line code segment, thread-level parallelism is explicitly repre- 
sented by the use of multiple threads of execution that are inherently parallel. 

Thread-level parallelism is an important alternative to instruction-level paral- 
lelism primarily because it could be more cost-effective to exploit than 
instruction-level parallelism. There are many important applications where 
thread-level parallelism occurs naturally, as it does in many server applications. 
In other cases, the software is being written from scratch, and expressing the 
inherent parallelism is easy, as is true in some embedded applications. Large, 
established applications written without parallelism in mind, however, pose a sig- 
nificant challenge and can be extremely costly to rewrite to exploit thread-level 
parallelism. Chapter 4 explores multiprocessors and the support they provide for 
thread-level parallelism. 

Thread-level and instruction-level parallelism exploit two different kinds of 
parallel structure in a program. One natural question to ask is whether it is possi- 
ble for a processor oriented at instruction-level parallelism to exploit thread-level 
parallelism. The motivation for this question comes from the observation that a 
data path designed to exploit higher amounts of ILP will find that functional units 
are often idle because of either stalls or dependences in the code. Could the paral- 
lelism among threads be used as a source of independent instructions that might 
keep the processor busy during stalls? Could this thread-level parallelism be used 
to employ the functional units that would otherwise lie idle when insufficient 
ILP exists? 

Multithreading allows multiple threads to share the functional units of a single 
processor in an overlapping fashion. To permit this sharing, the processor must 
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duplicate the independent state of each thread. For example, a separate copy of 
the register file, a separate PC, and a separate page table are required for each 
thread. The memory itself can be shared through the virtual memory mecha- 
nisms, which already support multiprogramming. In addition, the hardware must 
support the ability to change to a different thread relatively quickly; in particular, 
a thread switch should be much more efficient than a process switch, which typi- 
cally requires hundreds to thousands of processor cycles. 

There are two main approaches to multithreading. Fine-grained multithread- 
ing switches between threads on each instruction, causing the execution of multi- 
ple threads to be interleaved. This interleaving is often done in a round-robin 
fashion, skipping any threads that are stalled at that time. To make fine-grained 
multithreading practical, the CPU must be able to switch threads on every clock 
cycle. One key advantage of fine-grained multithreading is that it can hide the 
throughput losses that arise from both short and long stalls, since instructions 
from other threads can be executed when one thread stalls. The primary disad- 
vantage of fine-grained multithreading is that it slows down the execution of the 
individual threads, since a thread that is ready to execute without stalls will be de- 
layed by instructions from other threads. 

Coarse-grained multithreading was invented as an alternative to fine-grained 
multithreading. Coarse-grained multithreading switches threads only on costly 
stalls, such as level 2 cache misses. This change relieves the need to have thread- 
switching be essentially free and is much less likely to slow the processor down, 
since instructions from other threads will only be issued when a thread encoun- 
ters a costly stall. 

Coarse-grained multithreading suffers, however, from a major drawback: It is 
limited in its ability to overcome throughput losses, especially from shorter stalls. 
This limitation arises from the pipeline start-up costs of coarse-grain multithread- 
ing. Because a CPU with coarse-grained multithreading issues instructions from 
a single thread, when a stall occurs, the pipeline must be emptied or frozen. The 
new thread that begins executing after the stall must fill the pipeline before in- 
structions will be able to complete. Because of this start-up overhead, coarse- 
grained multithreading is much more useful for reducing the penalty of high-cost 
stalls, where pipeline refill is negligible compared to the stall time. 

The next subsection explores a variation on fine-grained multithreading that 
enables a superscalar processor to exploit ILP and multithreading in an integrated 
and efficient fashion. In Chapter 4, we return to the issue of multithreading when 
we discuss its integration with multiple CPUs in a single chip. 


Simultaneous Multithreading: Converting Thread-Level 
Parallelism into Instruction-Level Parallelism 


Simultaneous multithreading (SMT) is a variation on multithreading that uses the 
resources of a multiple-issue, dynamically scheduled processor to exploit TLP at 
the same time it exploits ILP. The key insight that motivates SMT is that modern 
multiple-issue processors often have more functional unit parallelism available 
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than a single thread can effectively use. Furthermore, with register renaming and 
dynamic scheduling, multiple instructions from independent threads can be is- 
sued without regard to the dependences among them; the resolution of the depen- 
dences can be handled by the dynamic scheduling capability. 

Figure 3.8 conceptually illustrates the differences in a processor's ability to 
exploit the resources of a superscalar for the following processor configurations: 


e A superscalar with no multithreading support 
e A superscalar with coarse-grained multithreading 
e A superscalar with fine-grained multithreading 


e A superscalar with simultaneous multithreading 


In the superscalar without multithreading support, the use of issue slots is 
limited by a lack of ILP, a topic we discussed in earlier sections. In addition, a 
major stall, such as an instruction cache miss, can leave the entire processor idle. 

In the coarse-grained multithreaded superscalar, the long stalls are partially 
hidden by switching to another thread that uses the resources of the processor. 
Although this reduces the number of completely idle clock cycles, within each 
clock cycle, the ILP limitations still lead to idle cycles. Furthermore, in a coarse- 
grained multithreaded processor, since thread switching only occurs when there 
is a stall and the new thread has a start-up period, there are likely to be some fully 
idle cycles remaining. 


Issue slots ———>> 


Superscalar Coarse MT Fine MT 





Figure 3.8 How four different approaches use the issue slots of a superscalar processor.The horizontal dimen- 
sion represents the instruction issue capability in each clock cycle.The vertical dimension represents a sequence of 
clock cycles. An empty (white) box indicates that the corresponding issue slot is unused in that clock cycle. The 
shades of grey and black correspond to four different threads in the multithreading processors. Black is also used to 
indicate the occupied issue slots in the case of the superscalar without multithreading support.The SunTI (aka Nia- 
gara) processor, which is discussed in the next chapter, is a fine-grained multithreaded architecture. 
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In the fine-grained case, the interleaving of threads eliminates fully empty 
slots. Because only one thread issues instructions in a given clock cycle, however, 
ILP limitations still lead to a significant number of idle slots within individual 
clock cycles. 

In the SMT case, TLP and ILP are exploited simultaneously, with multiple 
threads using the issue slots in a single clock cycle. Ideally, the issue slot usage is 
limited by imbalances in the resource needs and resource availability over multi- 
ple threads. In practice, other factors—including how many active threads are 
considered, finite limitations on buffers, the ability to fetch enough instructions 
from multiple threads, and practical limitations of what instruction combinations 
can issue from one thread and from multiple threads—can also restrict how many 
slots are used. Although Figure 3.8 greatly simplifies the real operation of these 
processors, it does illustrate the potential performance advantages of multithread- 
ing in general and SMT in particular. 

As mentioned earlier, simultaneous multithreading uses the insight that a dy- 
namically scheduled processor already has many of the hardware mechanisms 
needed to support the integrated exploitation of TLP through multithreading. In 
particular, dynamically scheduled superscalars have a large set of virtual registers 
that can be used to hold the register sets of independent threads (assuming sepa- 
rate renaming tables are kept for each thread). Because register renaming pro- 
vides unique register identifiers, instructions from multiple threads can be mixed 
in the data path without confusing sources and destinations across the threads. 

This observation leads to the insight that multithreading can be built on top 
of an out-of-order processor by adding a per-thread renaming table, keeping 
separate PCs, and providing the capability for instructions from multiple 
threads to commit. 

There are complications in handling instruction commit, since we would like 
instructions from independent threads to be able to commit independently. The 
independent commitment of instructions from separate threads can be supported 
by logically keeping a separate reorder buffer for each thread. 


Design Challenges in SMT 


Because a dynamically scheduled superscalar processor is likely to have a deep 
pipeline, SMT will be unlikely to gain much in performance if it were coarse- 
grained. Since SMT makes sense only in a fine-grained implementation, we must 
worry about the impact of fine-grained scheduling on single-thread performance. 
This effect can be minimized by having a preferred thread, which still permits 
multithreading to preserve some of its performance advantage with a smaller 
compromise in single-thread performance. 

At first glance, it might appear that a preferred-thread approach sacrifices nei- 
ther throughput nor single-thread performance. Unfortunately, with a preferred 
thread, the processor is likely to sacrifice some throughput when the preferred 
thread encounters a stall. The reason is that the pipeline is less likely to have a 
mix of instructions from several threads, resulting in greater probability that 
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either empty slots or a stall will occur. Throughput is maximized by having a suf- 
ficient number of independent threads to hide all stalls in any combination of 
threads. 

Unfortunately, mixing many threads will inevitably compromise the execution 
time of individual threads. Similar problems exist in instruction fetch. To maxi- 
mize single-thread performance, we should fetch as far ahead as possible in that 
single thread and always have the fetch unit free when a branch is mispredicted 
and a miss occurs in the prefetch buffer. Unfortunately, this limits the number of 
instructions available for scheduling from other threads, reducing throughput. All 
multithreaded processors must seek to balance this trade-off. 

In practice, the problems of dividing resources and balancing single-thread 
and multiple-thread performance turn out not to be as challenging as they sound, 
at least for current superscalar back ends. In particular, for current machines that 
issue four to eight instructions per cycle, it probably suffices to have a small num- 
ber of active threads, and an even smaller number of "preferred" threads. When- 
ever possible, the processor acts on behalf of a preferred thread. This starts with 
prefetching instructions: whenever the prefetch buffers for the preferred threads 
are not full, instructions are fetched for those threads. Only when the preferred 
thread buffers are full is the instruction unit directed to prefetch for other threads. 
Note that having two preferred threads means that we are simultaneously 
prefetching for two instruction streams, and this adds complexity to the instruc- 
tion fetch unit and the instruction cache. Similarly, the instruction issue unit can 
direct its attention to the preferred threads, considering other threads only if the 
preferred threads are stalled and cannot issue. 

There are a variety of other design challenges for an SMT processor, includ- 
ing the following: 


e Dealing with a larger register file needed to hold multiple contexts 


e Not affecting the clock cycle, particularly in critical steps such as instruction 
issue, where more candidate instructions need to be considered, and in 
instruction completion, where choosing what instructions to commit may be 
challenging 


e Ensuring that the cache and TLB conflicts generated by the simultaneous exe- 
cution of multiple threads do not cause significant performance degradation 


In viewing these problems, two observations are important. First, in many cases, 
the potential performance overhead due to multithreading is small, and simple 
choices work well enough. Second, the efficiency of current superscalars is low 
enough that there is room for significant improvement, even at the cost of some 
overhead. 

The IBM PowerS5 used the same pipeline as the Power4, but it added SMT 
support. In adding SMT, the designers found that they had to increase a num- 
ber of structures in the processor so as to minimize the negative performance 
consequences from fine-grained thread interaction. These changes included 
the following: 
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e Increasing the associativity of the LI instruction cache and the instruction 
address translation buffers 


e Adding per-thread load and store queues 

e Increasing the size of the L2 and L3 caches 

e Adding separate instruction prefetch and buffering 

e Increasing the number of virtual registers from 152 to 240 


e Increasing the size of several issue queues 


Because SMT exploits thread-level parallelism on a multiple-issue supersca- 
lar, it is most likely to be included in high-end processors targeted at server mar- 
kets. In addition, it is likely that there will be some mode to restrict the 
multithreading, so as to maximize the performance of a single thread. 


Potential Performance Advantages from SMT 


A key question is, How much performance can be gained by implementing SMT? 
When this question was explored in 2000-2001, researchers assumed that dy- 
namic superscalars would get much wider in the next five years, supporting six to 
eight issues per clock with speculative dynamic scheduling, many simultaneous 
loads and stores, large primary caches, and four to eight contexts with simulta- 
neous fetching from multiple contexts. For a variety of reasons, which will be- 
come more clear in the next section, no processor of this capability has been built 
nor is likely to be built in the near future. 

As a result, simulation research results that showed gains for multipro- 
grammed workloads of two or more times are unrealistic. In practice, the existing 
implementations of SMT offer only two contexts with fetching from only one, as 
well as more modest issue abilities. The result is that the gain from SMT is also 
more modest. 

For example, in the Pentium 4 Extreme, as implemented in HP-Compaq serv- 
ers, the use of SMT yields a performance improvement of 1.01 when running the 
SPECinfRate benchmark and about 1.07 when running the SPECfpRate bench- 
mark. In a separate study, Tuck and Tullsen [2003] observe that running a mix of 
each of the 26 SPEC benchmarks paired with every other SPEC benchmark (that 
is, 26° runs, if a benchmark is also run opposite itself) results in speedups ranging 
from 0.90 to 1.58, with an average speedup of 1.20. (Note that this measurement 
is different from SPECRate, which requires that each SPEC benchmark be run 
against a vendor-selected number of copies of the same benchmark.) On the 
SPLASH parallel benchmarks, they report multithreaded speedups ranging from 
1.02 to 1.67, with an average speedup of about 1.22. 

The IBM Power5 is the most aggressive implementation of SMT as of 2005 
and has extensive additions to support SMT, as described in the previous subsec- 
tion. A direct performance comparison of the Power5 in SMT mode, running two 
copies of an application on a processor, versus the PowerS in single-thread mode, 
with one process per core, shows speedup across a wide variety of benchmarks of 
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between 0.89 (a performance loss) to 1.41. Most applications, however, showed 
at least some gain from SMT; floating-point-intensive applications, which suf- 
fered the most cache conflicts, showed the least gains. 

Figure 3.9 shows the speedup for an 8-processor Power5 multiprocessor with 
and without SMT for the SPECRate2000 benchmarks, as described in the cap- 
tion. On average, the SPECintRate is 1.23 times faster, while the SPECfpRate is 
1.16 times faster. Note that a few floating-point benchmarks experience a slight 
decrease in performance in SMT mode, with the maximum reduction in speedup 
being 0.93. 
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Figure 3.9 A comparison of SMT and single-thread (ST) performance on the 8&-processor IBM eServer p5 575. 
Note that the y-axis starts at a speedup of 0.9, a performance loss. Only one processor in each Power5 core is active, 
which should slightly improve the results from SMT by decreasing destructive interference in the memory system. 
The SMT results are obtained by creating 16 user threads, while the ST results use only 8 threads; with only one 
thread per processor, the Power5 is switched to single-threaded mode by the OS. These results were collected by 
John McCalpin of IBM. As wecan see from the data, the standard deviation of the results for the SPECfpRate is higher 
than for SPECintRate (0.13 versus 0.07), indicating that the SMT improvement for FP programs is likely to vary widely. 
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These results clearly show the benefit of SMT for an aggressive speculative 
processor with extensive support for SMT. Because of the costs and diminishing 
returns in performance, however, rather than implement wider superscalars and 
more aggressive versions of SMT, many designers are opting to implement multi- 
ple CPU cores on a single die with slightly less aggressive support for multiple 
issue and multithreading; we return to this topic in the next chapter. 




















3.6 Putting It All Together: Performance and Efficiency 
in Advanced Multiple-lssue Processors 
In this section, we discuss the characteristics of several recent multiple-issue pro- 
cessors and examine their performance and their efficiency in use of silicon, tran- 
sistors, and energy. We then turn to a discussion of the practical limits of 
superscalars and the future of high-performance microprocessors. 

Figure 3.10 shows the characteristics of four of the most recent high- 
performance microprocessors. They vary widely in organization, issue rate, func- 
tional unit capability, clock rate, die size, transistor count, and power. As Figures 
3.11 and 3.12 show, there is no obvious overall leader in performance. The Ita- 
nium 2 and Power5, which perform similarly on SPECfp, clearly dominate the 
Athlon and Pentium 4 on those benchmarks. The AMD Athlon leads on SPECint 
performance followed by the Pentium 4, Itanium 2, and PowerS. 

Fetch/ Clock 
issue/ Func. rate Transistors 
Processor Microarchitecture execute units (GHz) and die size Power 
Intel Speculative dynamically 3/3/4 7 int. 3.8 125M ` 115W 
Pentium 4 Extreme scheduled; deeply 1FP 122 mm 
pipelined; SMT 
AMD Athlon 64 Speculative dynamically 3/3/4 6 int. 2.8 114M > 104 W 
FX-57 scheduled 3FP 115 mm 
IBM Power5 Speculative dynamically 8/4/8 6 int. 19 200M ; 80 W 
1 processor scheduled; SMT; two CPU 2FP 300 mm (estimated) 
cores/chip (estimated) 
Intel EPIC style; primarily 6/5/11 9 int. 16 592M ; 130W 
Itanium 2 statically scheduled 2FP 423 mm 





Figure 3.10 The characteristics of four recent multiple-issue processors. The Power5 includes two CPU cores, 
although we only look at the performance of one core in this chapter. The transistor count, area, and power con- 
sumption of the Power5 are estimated for one core based on two-core measurements of 276M, 389 mm’, and 125W, 
respectively. The large die and transistor count for the Itanium 2 is partly driven by a 9 MB tertiary cache on the chip. 
The AMD Opteron and Athlon both share the same core microarchitecture. Athlon is intended for desktops and does 
not support multiprocessing; Opteron is intended for servers and does.This is similar to the differentiation between 
Pentium and Xeon in the Intel product line. 


180 ° Chapter Three Limits on 


Instruction-Level Parallelism 





gzip 
vpr 
gec 
mcf ja 
crafty 
_— BB itanium 2 
3 Pentium 4@3,8 
eon 
bi BB AMD Athlon 64 
perlomk Power5 
gap = : 
vortex : 
bzip2 
0 500 1000 1500 2000 2500 3000 3500 


SPECRatio 


Figure 3.11 A comparison of the performance of the four advanced multiple-issue 
processors shown in Figure 3.10 for the SPECint2000 benchmarks. 


WUpwÍse ja 


W itanium 2 
Pentium 4@3,8 


swim 

mgrid 
a BB AMD Athlon 64 

applu pr = Powers 





mesa jams 
galgel 
art 
equake : 
facerec 
ammp jama 
lucas > 
fma3d 
sixtrack 
apsi 
0 2000 4000 6000 8000 10,000 12,000 14,000 


SPECRatio 


Figure 3.12 A comparison of the performance of the four advanced multiple-issue 
processors shown in Figure 3.10 for the SPECfp2000 benchmarks. 
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Figure 3.13 Efficiency measures for four multiple-issue processors. In the case of 
Power5, a single die includes two processor cores, so we estimate the single-core met- 
rics as power = 80 W, area = 290 mm*, and transistor count = 200M. 


As important as overall performance is, the question of efficiency in terms of 
silicon area and power is equally critical. As we discussed in Chapter 1, power 
has become the major constraint on modern processors. Figure 3.13 shows how 
these processors compare in terms of efficiency, by charting the SPECint and 
SPECfp performance versus the transistor count, silicon area, and power. The 
results provide an interesting contrast to the performance results. The Itanium 2 
is the most inefficient processor both for floating-point and integer code for all 
but one measure (SPECfp/watt). The Athlon and Pentium 4 both make good use 
of transistors and area in terms of efficiency, while the IBM PowerS is the most 
effective user of energy on SPECfp and essentially tied on SPECint. The fact that 
none of the processors offer an overwhelming advantage in efficiency across mul- 
tiple measures leads us to believe that none of these approaches provide a "silver 
bullet" that will allow the exploitation of ILP to scale easily and efficiently much 
beyond current levels. 

Let's try to understand why this is the case. 


What Limits Multiple-lssue Processors? 


The limitations explored in Sections 3.1 and 3.3 act as significant barriers to 
exploiting more ILP, but they are not the only barriers. For example, doubling the 
issue rates above the current rates of 3-6 instructions per clock, say, to 6-12 
instructions, will probably require a processor to issue three or four data memory 
accesses per cycle, resolve two or three branches per cycle, rename and access 
more than 20 registers per cycle, and fetch 12-24 instructions per cycle. The 
complexities of implementing these capabilities is likely to mean sacrifices in the 
maximum clock rate. For example, the widest-issue processor in Figure 3.10 is 
the Itanium 2, but it also has the slowest clock rate, despite the fact that it con- 
sumes the most power! 
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It is now widely accepted that modern microprocessors are primarily power 
limited. Power is a function of both static power, which grows proportionally to 
the transistor count (whether or not the transistors are switching), and dynamic 
power, which is proportional to the product of the number of switching transis- 
tors and the switching rate. Although static power is certainly a design concern, 
when operating, dynamic power is usually the dominant energy consumer. A 
microprocessor trying to achieve both a low CPI and a high CR must switch more 
transistors and switch them faster, increasing the power consumption as the prod- 
uct of the two. 

Of course, most techniques for increasing performance, including multiple 
cores and multithreading, will increase power consumption. The key question is 
whether a technique is energy efficient: Does it increase power consumption 
faster than it increases performance? Unfortunately, the techniques we currently 
have to boost the performance of multiple-issue processors all have this ineffi- 
ciency, which arises from two primary characteristics. 

First, issuing multiple instructions incurs some overhead in logic that grows 
faster than the issue rate grows. This logic is responsible for instruction issue 
analysis, including dependence checking, register renaming, and similar func- 
tions. The combined result is that, without voltage reductions to decrease power, 
lower CPIs are likely to lead to lower ratios of performance per watt, simply due 
to overhead. 

Second, and more important, is the growing gap between peak issue rates and 
sustained performance. Since the number of transistors switching will be propor- 
tional to the peak issue rate, and the performance is proportional to the sustained 
rate, a growing performance gap between peak and sustained performance trans- 
lates to increasing energy per unit of performance. Unfortunately, this growing 
gap appears to be quite fundamental and arises from many of the issues we dis- 
cuss in Sections 3.2 and 3.3. For example, if we want to sustain four instructions 
per clock, we must fetch more, issue more, and initiate execution on more than 
four instructions. The power will be proportional to the peak rate, but perfor- 
mance will be at the sustained rate. (In many recent processors, provision has 
been made for decreasing power consumption by shutting down an inactive por- 
tion of a processor, including powering off the clock to that portion of the chip. 
Such techniques, while useful, cannot prevent the long-term decrease in power 
efficiency.) 

Furthermore, the most important technique of the last decade for increasing 
the exploitation of ILP—namely, speculation—is inherently inefficient. Why? 
Because it can never be perfect; that is, there is inherently waste in executing 
computations before we know whether they advance the program. 

If speculation were perfect, it could save power, since it would reduce the 
execution time and save static power, while adding some additional overhead to 
implement. When speculation is not perfect, it rapidly becomes energy ineffi- 
cient, since it requires additional dynamic power both for the incorrect specula- 
tion and for the resetting of the processor state. Because of the overhead of 
implementing speculation—register renaming, reorder buffers, more registers, 


3.7 


Fallacy 


3.7 Fallacies and Pitfalls... . 183 


and so on—it is unlikely that any speculative processor could save energy for a 
significant range of realistic programs. 

What about focusing on improving clock rate? Unfortunately, a similar 
conundrum applies to attempts to increase clock rate: increasing the clock rate 
will increase transistor switching frequency and directly increase power con- 
sumption. To achieve a faster clock rate, we would need to increase pipeline 
depth. Deeper pipelines, however, incur additional overhead penalties as well as 
causing higher switching rates. 

The best example of this phenomenon comes from comparing the Pentium III 
and Pentium 4. To a first approximation, the Pentium 4 is a deeply pipelined ver- 
sion of the Pentium III architecture. In a similar process, it consumes roughly an 
amount of power proportional to the difference in clock rate. Unfortunately, its 
performance is somewhat less than the ratio of the clock rates because of over- 
head and ILP limitations. 

It appears that we have reached—and, in some cases, possibly even sur- 
passed—the point of diminishing returns in our attempts to exploit ILP. The 
implications of these limits can be seen over the last few years in the slower per- 
formance growth rates (see Chapter 1), in the lack of increase in issue capability, 
and in the emergence of multicore designs; we return to this issue in the conclud- 
ing remarks. 


Fallacies and Pitfalls 


There is a simple approach to multiple-issue processors that yields high perfor- 
mance without a significant investment in silicon area or design complexity. 


The last few sections should have made this point obvious. What has been sur- 
prising is that many designers have believed that this fallacy was accurate and 
committed significant effort to trying to find this "silver bullet" approach. 
Although it is possible to build relatively simple multiple-issue processors, as 
issue rates increase, diminishing returns appear and the silicon and energy costs 
of wider issue dominate the performance gains. 

In addition to the hardware inefficiency, it has become clear that compiling 
for processors with significant amounts of ILP has become extremely complex. 
Not only must the compiler support a wide set of sophisticated transformations, 
but tuning the compiler to achieve good performance across a wide set of bench- 
marks appears to be very difficult. 

Obtaining good performance is also affected by design decisions at the sys- 
tem level, and such choices can be complex, as the last section clearly illustrated. 


Pitfall Improving only one aspect of a multiple-issue processor and expecting overall per- 


formance improvement. 
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This pitfall is simply a restatement of Amdahl's Law. A designer might simply 
look at a design, see a poor branch-prediction mechanism, and improve it, 
expecting to see significant performance improvements. The difficulty is that 
many factors limit the performance of multiple-issue machines, and improving 
one aspect of a processor often exposes some other aspect that previously did not 
limit performance. 

We can see examples of this in the data on ILP. For example, looking just at 
the effect of branch prediction in Figure 3.3 on page 160, we can see that going 
from a standard 2-bit predictor to a tournament predictor significantly improves 
the parallelism in espresso (from an issue rate of 7 to an issue rate of 12). If the 
processor provides only 32 registers for renaming, however, the amount of paral- 
lelism is limited to 5 issues per clock cycle, even with a branch-prediction 
scheme better than either alternative. 


Concluding Remarks 


The relative merits of software-intensive and hardware-intensive approaches to 
exploiting ILP continue to be debated, although the debate has shifted in the 
last five years. Initially, the software-intensive and hardware-intensive 
approaches were quite different, and the ability to manage the complexity of 
the hardware-intensive approaches was in doubt. The development of several 
high-performance dynamic speculation processors, which have high clock 
rates, has eased this concern. 

The complexity of the IA-64 architecture and the Itanium design has signaled 
to many designers that it is unlikely that a software-intensive approach will pro- 
duce processors that are significantly faster (especially for integer code), smaller 
(in transistor count or die size), simpler, or more power efficient. It has become 
clear in the past five years that the IA-64 architecture does not represent a signifi- 
cant breakthrough in scaling ILP or in avoiding the problems of complexity and 
power consumption in high-performance processors. Appendix H explores this 
assessment in more detail. 

The limits of complexity and diminishing returns for wider issue probably 
also mean that only limited use of simultaneous multithreading is likely. It sim- 
ply is not worthwhile to build the very wide issue processors that would justify 
the most aggressive implementations of SMT. For this reason, existing designs 
have used modest, two-context versions of SMT or simple multithreading with 
two contexts, which is the appropriate choice with simple one- or two-issue 
processors. 

Instead of pursuing more ILP, architects are increasingly focusing on TLP 
implemented with single-chip multiprocessors, which we explore in the next 
chapter. In 2000, IBM announced the first commercial single-chip, general-pur- 
pose multiprocessor, the Power4, which contains two Power3 processors and an 
integrated second-level cache. Since then, Sun Microsystems, AMD, and Intel 
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have switched to a focus on single-chip multiprocessors rather than more aggres- 
sive uniprocessors. 

The question of the right balance of ILP and TLP is still open in 2005, and 
designers are exploring the full range of options, from simple pipelining with 
more processors per chip, to aggressive ILP and SMT with fewer processors. It 
may well be that the right choice for the server market, which can exploit more 
TLP, may differ from the desktop, where single-thread performance may con- 
tinue to be a primary requirement. We return to this topic in the next chapter. 


Historical Perspective and References 


Section K.4 on the companion CD features a discussion on the development of 
pipelining and instruction-level parallelism. We provide numerous references for 
further reading and exploration of these topics. 


Case Study with Exercises by Wen-mei W. Hwu and 
John W. Sias 


Concepts illustrated by this case study 


e Limited ILP due to software dependences 
e Achievable ILP with hardware resource constraints 
e Variability of ILP due to software and hardware interaction 


e Tradeoffs in ILP techniques at compile time vs. execution time 


Case Study: Dependences and Instruction-Level Parallelism 


The purpose of this case study is to demonstrate the interaction of hardware and 
software factors in producing instruction-level parallel execution. This case study 
presents a concise code example that concretely illustrates the various limits on 
instruction-level parallelism. By working with this case study, you will gain intu- 
ition about how hardware and software factors interact to determine the execution 
time of a particular type of code on a given system. 

A hash table is a popular data structure for organizing a large collection of 
data items so that one can quickly answer questions such as, "Does an element of 
value 100 exist in the collection?" This is done by assigning data elements into 
one of a large number of buckets according to a hash function value generated 
from the data values. The data items in each bucket are typically organized as a 
linked list sorted according to a given order. A lookup of the hash table starts by 
determining the bucket that corresponds to the data value in question. It then 
traverses the linked list of data elements in the bucket and checks if any element 
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in the list has the value in question. As long as one keeps the number of data ele- 
ments in each bucket small, the search result can be determined very quickly. 

The C source code in Figure 3.14 inserts a large number (N_ELEMENTS) of 
elements into a hash table, whose 1024 buckets are all linked lists sorted in 
ascending order according to the value of the elements. The array element[] 
contains the elements to be inserted, allocated on the heap. Each iteration of the 
outer (for) loop, starting at line 6, enters one element into the hash table. 

Line 9 in Figure 3.14 calculates hash_index, the hash function value, from 
the data value stored in el ement [i]. The hashing function used is a very simple 


typedef struct _Element { 
int value; 
struct _Element *next; 

} Element; 

Element element[N_ELEMENTS], *bucket[1024]; 

/* The array element is initialized with the items to be inserted; 
the pointers in the array bucket are initialized to NULL */ 


OF w N e 


6 for (i = 0; i < NELEMENTS; i++) 


{ 
7 Element *ptrCurr, **ptrUpdate; 


int hash_index; 


/* Find the location at which the new element is to be inserted. */ 


9 hashindex = element[i].value & 1023; 
10 ptrUpdate = &bucket [hash_index] ; 
11 ptrCurr = bucket [hash_index] ; 
/* Find the place in the chain to insert the new element. */ 
12 while (ptrCurr && 
T3. ptrCurr->value <= element[i].value) 
14 { 


15 ptrUpdate = &ptrCurr->next; 


16 ptrCurr = ptrCurr->next; 


/* Update pointers to insert the new element into the chain. */ 
17 element [i] .next = *ptrUpdate; 
18 *ptrUpdate = &element [i]; 


} 


Figure 3.14 Hash table code example. 
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one; it consists of the least significant 10 bits of an element's data value. This is 
done by computing the bitwise logical AND of the element data value and the 
(binary) bit mask 1111111111 (1023 in decimal). 

Figure 3.15 illustrates the hash table data structure used in our C code exam- 
ple. The bucket array on the left side of Figure 3.15 is the hash table. Each entry 
of the bucket array contains a pointer to the linked list that stores the data ele- 
ments in the bucket. If bucket i is currently empty, the corresponding bucket [i] 
entry contains a NUL pointer. In Figure 3.15, the first three buckets contain one 
data element each; the other buckets are empty. 

Variable ptrCurr contains a pointer used to examine the elements in the 
linked list of a bucket. At Line 11 of Figure 3.14, ptrCurr is set to point to the 
first element of the linked list stored in the given bucket of the hash table. If the 
bucket selected by the hash_index is empty, the corresponding bucket array 
entry contains a NUL pointer. 

The whi 1 e loop starts at line 12. Line 12 tests if there is any more data ele- 
ments to be examined by checking the contents of variable ptrCurr. Lines 13 
through 16 will be skipped if there are no more elements to be examined, either 
because the bucket is empty, or because all the data elements in the linked list 
have been examined by previous iterations of the whi | e loop. In the first case, the 
new data element will be inserted as the first element in the bucket. In the second 
case, the new element will be inserted as the last element of the linked list. 

In the case where there are still more elements to be examined, line 13 tests if 
the current linked list element contains a value that is smaller than or equal to that 
of the data element to be inserted into the hash table. If the condition is true, the 
while loop will continue to move on to the next element in the linked list; lines 
15 and 16 advance to the next data element of the linked list by moving ptrCurr 
to the next element in the linked list. Otherwise, it has found the position in the 


bucket value next 


element [0] 





element [1] 





element [2] 




















Figure 3.15 Hash table data structure. 
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linked list where the new data element should be inserted; the whi 1 e loop wil 
terminate and the new data element will be inserted right before the elemen 
pointed to by ptrCurr. 

The variable ptrUpdate identifies the pointer that must be updated in order tc 
insert the new data element into the bucket. It is set by line 10 to point to th« 
bucket entry. If the bucket is empty, the while loop will be skipped altogethei 
and the new data element is inserted by changing the pointer ir 
bucket [hash_i ndex] from NULL to the address of the new data element by line 
18. After the whi 1 e loop, ptrUpdate points to the pointer that must be updatec 
for the new element to be inserted into the appropriate bucket. 

After the execution exits the while loop, lines 17 and 18 finish the job oil 
inserting the new data element into the linked list. In the case where the bucket is 
empty, ptrUpdate will point to bucket[hash_index], which contains a NIL 
pointer. Line 17 will then assign that NULL pointer to the next pointer of the new I 
data element. Line 18 changes bucket [hash_table] to point to the new data 
element. In the case where the new data element is smaller than all elements in 
the linked list, ptrUpdate will also point to bucket [hash_tabl e], which points 
to the first element of the linked list. In this case, line 17 assigns the pointer to the 
first element of the linked list to the next pointer of the new data structure. 

In the case where the new data element is greater than some of the linked list 
elements but smaller than the others, ptrUpdate will point to the next pointer of 
the element after which the new data element will be inserted. In this case, line 171 
makes the new data element to point to the element right after the insertion point. 
Line 18 makes the original data element right before the insertion point to point I 
to the new data element. The reader should verify that the code works correctly 
when the new data element is to be inserted to the end of the linked list. 

Now that we have a good understanding of the C code, we will proceed with 
analyzing the amount of instruction-level parallelism available in this piece of 
code. 


[25/15/10/15/20/20/15] <2.1, 2.2, 3.2, 3.3, App. H> This part of our case study 
will focus on the amount of instruction-level parallelism available to the run time 
hardware scheduler under the most favorable execution scenarios (the ideal 
case). (Later, we will consider less ideal scenarios for the run time hardware 
scheduler as well as the amount of parallelism available to a compiler scheduler.) 
For the ideal scenario, assume that the hash table is initially empty. Suppose there 
are 1024 new data elements, whose values are numbered sequentially from 0 to 
1023, so that each goes in its own bucket (this reduces the problem to a matter of 
updating known array locations!). Figure 3.15 shows the hash table contents after 
the first three elements have been inserted, according to this "ideal case." Since 
the value of element[ i] is simply i in this ideal case, each element is 
inserted into its own bucket. 


For the purposes of this case study, assume that each line of code in Figure 3.14 
takes one execution cycle (its dependence height is 1) and, for the purposes of 
computing ILP, takes one instruction. These (unrealistic) assumptions are made 
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to greatly simplify bookkeeping in solving the following exercises. Note that the 
for and whi | e statements execute on each iteration of their respective loops, to 
test if the loop should continue. In this ideal case, most of the dependences in the 
code sequence are relaxed and a high degree of ILP is therefore readily available. 
We will later examine a general case, in which the realistic dependences in the 
code segment reduce the amount of parallelism available. 


Further suppose that the code is executed on an "ideal" processor with infinite 
issue width, unlimited renaming, "omniscient" knowledge of memory access dis- 
ambiguation, branch prediction, and so on, so that the execution of instructions is 
limited only by data dependence. Consider the following in this context: 


a. 


[25] <2.1> Describe the data (true, anti, and output) and control dependences 
that govern the parallelism of this code segment, as seen by a run time hard- 
ware scheduler. Indicate only the actual dependences (i.e., ignore depen- 
dences between stores and loads that access different addresses, even if a 
compiler or processor would not realistically determine this). Draw the 
dynamic dependence graph for six consecutive iterations of the outer loop 
(for insertion of six elements), under the ideal case. Note that in this dynamic 
dependence graph, we are identifying data dependences between dynamic 
instances of instructions: each static instruction in the original program has 
multiple dynamic instances due to loop execution. Hint: The following defi- 
nitions may help you find the dependences related to each instruction: 


e Data true dependence: On the results of which previous instructions does 
each instruction immediately depend? 


e Data antidependence: Which instructions subsequently write locations 
read by the instruction? 


e Data output dependence: Which instructions subsequently write locations 
written by the instruction? 


e Control dependence: On what previous decisions does the execution of a 
particular instruction depend (in what case will it be reached)? 


[15] <2.1> Assuming the ideal case just described, and using the dynamic 
dependence graph you just constructed, how many instructions are executed, 
and in how many cycles? 


[10] <3.2> What is the average level of ILP available during the execution of 
the for loop? 


[15] <2.2, App. H> In part (c) we considered the maximum parallelism 
achievable by a run-time hardware scheduler using the code as written. How 
could a compiler increase the available parallelism, assuming that the com- 
piler knows that it is dealing with the ideal case. Hint: Think about what is 
the primary constraint that prevents executing more iterations at once in the 
ideal case. How can the loop be restructured to relax that constraint? 
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e. [25] <3.2, 3.3> For simplicity, assume that only variables i, hash_index, 
ptrCurr, and ptrUpdate need to occupy registers. Assuming general renam- 
ing, how many registers are necessary to achieve the maximum achievable 
parallelism in part (b)? 


f. [25] <3.3> Assume that in your answer to part (a) there are 7 instructions in 
each iteration. Now, assuming a consistent steady-state schedule of the 
instructions in the example and an issue rate of 3 instructions per cycle, how 
is execution time affected? 


g. [15] <3.3> Finally, calculate the minimal instruction window size needed to 
achieve the maximal level of parallelism. 


[15/15/15/10/10/15/15/10/10/10/25] <2.1, 3.2, 3.3> Let us now consider less 
favorable scenarios for extraction of instruction-level parallelism by a run-time 
hardware scheduler in the hash table code in Figure 3.14 (the general case). Sup- 
pose that there is no longer a guarantee that each bucket will receive exactly one 
item. Let us reevaluate our assessment of the parallelism available, given the 
more realistic situation, which adds some additional, important dependences. 


Recall that in the ideal case, the relatively serial inner loop was not in play, and 
the outer loop provided ample parallelism. In general, the inner loop is in play: 
the inner whi 1 e loop could iterate one or more times. Keep in mind that the inner 
loop, the whi 1 e loop, has only a limited amount of instruction-level parallelism. 
First of all, each iteration of the while loop depends on the result of the previous 
iteration. Second, within each iteration, only a small number of instructions are 
executed. 


The outer loop is, on the contrary, quite parallel. As long as two elements of the 
outer loop are hashed into different buckets, they can be entered in parallel. Even 
when they are hashed to the same bucket, they can still go in parallel as long as 
some type of memory disambiguation enforces correctness of memory loads and 
stores performed on behalf of each element. 


Note that in reality, the data element values will likely be randomly distributed. 
Although we aim to provide the reader insight into more realistic execution sce- 
narios, we will begin with some regular but nonideal data value patterns that are 
amenable to systematic analysis. These value patterns offer some intermediate 
steps toward understanding the amount of instruction-level parallelism under the 
most general, random data values. 


a. [15] <2.1> Draw a dynamic dependence graph for the hash table code in 
Figure 3.14 when the values of the 1024 data elements to be inserted are 0, 
1, 1024, 1025, 2048, 2049, 3072, 3073,.... Describe the new dependences 
across iterations for the for loop when the while loop is iterated one or 
more times. Pay special attention to the fact that the inner whi | e loop now 
can iterate one or more times. The number of instructions in the outer for 
loop will therefore likely vary as it iterates. For the purpose of determining 
dependences between loads and stores, assume a dynamic memory disam- 
biguation that cannot resolve the dependences between two memory 
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accesses based on different base pointer registers. For example, the run time 
hardware cannot disambiguate between a store based on ptrUpdate and a 
load based on ptrCurr. 


[15] <2.1> Assuming the dynamic dependence graph you derived in part (a), 
how many instructions will be executed? 


[15] <2.1> Assuming the dynamic dependence graph you derived in part (a) 
and an unlimited amount of hardware resources, how many clock cycles will 
it take to execute all the instructions you calculated in part (b)? 


[10] <2.1> How much instruction-level parallelism is available in the 
dynamic dependence graph you derived in part (a)? 

[10] <2.1, 3.2> Using the same assumption of run time memory disambigua- 
tion mechanism as in part (a), identify a sequence of data elements that will 
cause the worst-case scenario of the way these new dependences affect the 
level of parallelism available. 


[15] <2.1, 3.2> Now, assume the worst-case sequence used in part (e), explain 
the potential effect of a perfect run time memory disambiguation mechanism 
(i.e., a system that tracks all outstanding stores and allows all nonconflicting 
loads to proceed). Derive the number of clock cycles required to execute all 
the instructions in the dynamic dependence graph. 


On the basis of what you have learned so far, consider a couple of qualitative 
questions: What is the effect of allowing loads to issue speculatively, before 
prior store addresses are known? How does such speculation affect the signif- 
icance of memory latency in this code? 


[15] <2.1, 3.2> Continue the same assumptions as in part (f), and calculate 
the number of instructions executed. 


[10] <2.1, 3.2> Continue the same assumptions as in part (f), and calculate 
the amount of instruction-level parallelism available to the run-time hard- 
ware. 


[10] <2.1, 3.2> In part (h), what is the effect of limited instruction window 
sizes on the level of instruction-level parallelism? 


[10] <3.2, 3.3> Now, continuing to consider your solution to part (h), 
describe the cause of branch-prediction misses and the effect of each branch 
prediction on the level of parallelism available. Reflect briefly on the implica- 
tions for power and efficiency. What are potential costs and benefits to exe- 
cuting many off-path speculative instructions (i.e., initiating execution of 
instructions that will later be squashed by branch-misprediction detection)? 
Hint: Think about the effect on the execution of subsequent insertions of 
mispredicting the number of elements before the insertion point. 


[25] <3> Consider the concept of a static dependence graph that captures all 
the worst-case dependences for the purpose of constraining compiler schedul- 
ing and optimization. Draw the static dependence graph for the hash table 
code shown in Figure 3.14. 
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Compare the static dependence graph with the various dynamic dependence 
graphs drawn previously. Reflect in a paragraph or two on the implications of 
this comparison for dynamic and static discovery of instruction-level parallel- 
ism in this example's hash table code. In particular, how is the compiler con- 
strained by having to consistently take into consideration the worst case, 
where a hardware mechanism might be free to take advantage opportunisti- 
cally of fortuitous cases? What sort of approaches might help the compiler to 
make better use of this code? 
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Multiprocessors and 
Thread-Level Parallelism 


The turning away from the conventional organization came in the 
middle 1960s, when the law of diminishing returns began to take 
effect in the effort to increase the operational speed of a computer.... 
Electronic circuits are ultimately limited in their speed of operation by 
the speed of light...and many ofthe circuits were already operating in 
the nanosecond range. 


W.Jack Bouknightetal. 
The Illiac IVSystem (1972) 


We are dedicating all of our future product development to multicore 
designs. We believe this is a key inflection point for the industry. 
Intel President Paul Otellini, 


describing Intel's future direction at the 
Intel Developers Forum in 2005 
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4.1 


Introduction 


As the quotation that opens this chapter shows, the view that advances in uni- 
processor architecture were nearing an end has been held by some researchers for 
many years. Clearly these views were premature; in fact, during the period of 
1986-2002, uniprocessor performance growth, driven by the microprocessor, 
was at its highest rate since the first transistorized computers in the late 1950s 
and early 1960s. 

Nonetheless, the importance of multiprocessors was growing throughout the 
1990s as designers sought a way to build servers and supercomputers that 
achieved higher performance than a single microprocessor, while exploiting the 
tremendous cost-performance advantages of commodity microprocessors. As 
we discussed in Chapters 1 and 3, the slowdown in uniprocessor performance 
arising from diminishing returns in exploiting ILP, combined with growing con- 
cern over power, is leading to a new era in computer architecture—an era where 
multiprocessors play a major role. The second quotation captures this clear 
inflection point. 

This trend toward more reliance on multiprocessing is reinforced by other 
factors: 


e A growing interest in servers and server performance 
e A growth in data-intensive applications 


e The insight that increasing performance on the desktop is less important (out- 
side of graphics, at least) 


e An improved understanding of how to use multiprocessors effectively, espe- 
cially in server environments where there is significant natural thread-level 
parallelism 


e The advantages of leveraging a design investment by replication rather than 
unique design—all multiprocessor designs provide such leverage 


That said, we are left with two problems. First, multiprocessor architecture is 
a large and diverse field, and much of the field is in its youth, with ideas coming 
and going and, until very recently, more architectures failing than succeeding. 
Full coverage of the multiprocessor design space and its trade-offs would require 
another volume. (Indeed, Culler, Singh, and Gupta [1999] cover only multipro- 
cessors in their 1000-page book!) Second, broad coverage would necessarily 
entail discussing approaches that may not stand the test of time—something we 
have largely avoided to this point. 

For these reasons, we have chosen to focus on the mainstream of multiproces- 
sor design: multiprocessors with small to medium numbers of processors (4 to 
32). Such designs vastly dominate in terms of both units and dollars. We will pay 
only slight attention to the larger-scale multiprocessor design space (32 or more 
processors), primarily in Appendix H, which covers more aspects of the design of 
such processors, as well as the behavior performance for parallel scientific work- 
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loads, a primary class of applications for large-scale multiprocessors. In the 
large-scale multiprocessors, the interconnection networks are a critical part of the 
design; Appendix E focuses on that topic. 


A Taxonomy of Parallel Architectures 


We begin this chapter with a taxonomy so that you can appreciate both the 
breadth of design alternatives for multiprocessors and the context that has led to 
the development of the dominant form of multiprocessors. We briefly describe 
the alternatives and the rationale behind them; a longer description of how these 
different models were born (and often died) can be found in Appendix K. 

The idea of using multiple processors both to increase performance and to 
improve availability dates back to the earliest electronic computers. About 40 
years ago, Flynn [1966] proposed a simple model of categorizing all computers 
that is still useful today. He looked at the parallelism in the instruction and data 
streams called for by the instructions at the most constrained component of the 
multiprocessor, and placed all computers into one of four categories: 


1. Single instruction stream, single data stream (SISD)—This category is the 
uniprocessor. 


2. Single instruction stream, multiple data streams (SIMD)—The same instruc- 
tion is executed by multiple processors using different data streams. SIMD 
computers exploit data-level parallelism by applying the same operations to 
multiple items of data in parallel. Each processor has its own data memory 
(hence multiple data), but there is a single instruction memory and control 
processor, which fetches and dispatches instructions. For applications that 
display significant data-level parallelism, the SIMD approach can be very 
efficient. The multimedia extensions discussed in Appendices B and C are a 
form of SIMD parallelism. Vector architectures, discussed in Appendix F, are 
the largest class of SIMD architectures. SIMD approaches have experienced a 
rebirth in the last few years with the growing importance of graphics perfor- 
mance, especially for the game market. SIMD approaches are the favored 
method for achieving the high performance needed to create realistic three- 
dimensional, real-time virtual environments. 


3. Multiple instruction streams, single data stream (MISD)—No commercial 
multiprocessor of this type has been built to date. 


4. Multiple instruction streams, multiple data streams (MIMD)—Each proces- 
sor fetches its own instructions and operates on its own data. MIMD comput- 
ers exploit thread-level parallelism, since multiple threads operate in parallel. 
In general, thread-level parallelism is more flexible than data-level parallel- 
ism and thus more generally applicable. 


This is a coarse model, as some multiprocessors are hybrids of these categories. 
Nonetheless, it is useful to put a framework on the design space. 


Chapter Four Multiprocessors and Thread-Level Parallelism 


Because the MIMD model can exploit thread-level parallelism, it is the archi- 
tecture of choice for general-purpose multiprocessors and our focus in this chap- 
ter. Two other factors have also contributed to the rise of the MIMD 
multiprocessors: 


1. MIMDs offer flexibility. With the correct hardware and software support, 
MIMDs can function as single-user multiprocessors focusing on high perfor- 
mance for one application, as multiprogrammed multiprocessors running 
many tasks simultaneously, or as some combination of these functions. 


2. MIMDs can build on the cost-performance advantages of off-the-shelf pro- 
cessors. In fact, nearly all multiprocessors built today use the same micropro- 
cessors found in workstations and single-processor servers. Furthermore, 
multicore chips leverage the design investment in a single processor core by 
replicating it. 


One popular class of MIMD computers are clusters, which often use stan- 
dard components and often standard network technology, so as to leverage as 
much commodity technology as possible. In Appendix H we distinguish two 
different types of clusters: commodity clusters, which rely entirely on third- 
party processors and interconnection technology, and custom clusters, in which 
a designer customizes either the detailed node design or the interconnection 
network, or both. 

In a commodity cluster, the nodes of a cluster are often blades or rack- 
mounted servers (including small-scale multiprocessor servers). Applications that 
focus on throughput and require almost no communication among threads, such 
as Web serving, multiprogramming, and some transaction-processing applica- 
tions, can be accommodated inexpensively on a cluster. Commodity clusters are 
often assembled by users or computer center directors, rather than by vendors. 

Custom clusters are typically focused on parallel applications that can 
exploit large amounts of parallelism on a single problem. Such applications 
require a significant amount of communication during the computation, and 
customizing the node and interconnect design makes such communication 
more efficient than in a commodity cluster. Currently, the largest and fastest 
multiprocessors in existence are custom clusters, such as the IBM Blue Gene, 
which we discuss in Appendix H. 

Starting in the 1990s, the increasing capacity of a single chip allowed design- 
ers to place multiple processors on a single die. This approach, initially called on- 
chip multiprocessing or single-chip multiprocessing, has come to be called multi- 
core, aname arising from the use of multiple processor cores on a single die. In 
such a design, the multiple cores typically share some resources, such as a 
second- or third-level cache or memory and I/O buses. Recent processors, includ- 
ing the IBM PowerS, the Sun T1, and the Intel Pentium D and Xeon-MP, are mul- 
ticore and multithreaded. Just as using multiple copies of a microprocessor in a 
multiprocessor leverages a design investment through replication, a multicore 
achieves the same advantage relying more on replication than the alternative of 
building a wider superscalar. 
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With an MIMD, each processor is executing its own instruction stream. In 
many cases, each processor executes a different process. A process is a segment 
of code that may be run independently; the state of the process contains all the 
information necessary to execute that program on a processor. In a multipro- 
grammed environment, where the processors may be running independent tasks, 
each process is typically independent of other processes. 

It is also useful to be able to have multiple processors executing a single pro- 
gram and sharing the code and most of their address space. When multiple pro- 
cesses share code and data in this way, they are often called threads. Today, the 
term thread is often used in a casual way to refer to multiple loci of execution that 
may run on different processors, even when they do not share an address space. 
For example, a multithreaded architecture actually allows the simultaneous exe- 
cution of multiple processes, with potentially separate address spaces, as well as 
multiple threads that share the same address space. 

To take advantage of an MIMD multiprocessor with n processors, we must 
usually have at least n threads or processes to execute. The independent threads 
within a single process are typically identified by the programmer or created by 
the compiler. The threads may come from large-scale, independent processes 
scheduled and manipulated by the operating system. At the other extreme, a 
thread may consist of a few tens of iterations of a loop, generated by a parallel 
compiler exploiting data parallelism in the loop. Although the amount of compu- 
tation assigned to a thread, called the grain size, is important in considering how 
to exploit thread-level parallelism efficiently, the important qualitative distinction 
from instruction-level parallelism is that thread-level parallelism is identified at a 
high level by the software system and that the threads consist of hundreds to mil- 
lions of instructions that may be executed in parallel. 

Threads can also be used to exploit data-level parallelism, although the over- 
head is likely to be higher than would be seen in an SIMD computer. This over- 
head means that grain size must be sufficiently large to exploit the parallelism 
efficiently. For example, although a vector processor (see Appendix F) may be 
able to efficiently parallelize operations on short vectors, the resulting grain size 
when the parallelism is split among many threads may be so small that the over- 
head makes the exploitation of the parallelism prohibitively expensive. 

Existing MIMD multiprocessors fall into two classes, depending on the num- 
ber of processors involved, which in turn dictates a memory organization and 
interconnect strategy. We refer to the multiprocessors by their memory organiza- 
tion because what constitutes a small or large number of processors is likely to 
change over time. 

The first group, which we call centralized shared-memory architectures, has 
at most a few dozen processor chips (and less than 100 cores) in 2006. For multi- 
processors with small processor counts, it is possible for the processors to share a 
single centralized memory. With large caches, a single memory, possibly with 
multiple banks, can satisfy the memory demands of a small number of proces- 
sors. By using multiple point-to-point connections, or a switch, and adding addi- 
tional memory banks, a centralized shared-memory design can be scaled to a few 
dozen processors. Although scaling beyond that is technically conceivable, 
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sharing a centralized memory becomes less attractive as the number of proces- 
sors sharing it increases. 

Because there is a single main memory that has a symmetric relationship to 
all processors and a uniform access time from any processor, these multiproces- 
sors are most often called symmetric (shared-memory) multiprocessors (SMPs), 
and this style of architecture is sometimes called uniform memory access (UMA), 
arising from the fact that all processors have a uniform latency from memory, 
even if the memory is organized into multiple banks. Figure 4.1 shows what these 
multiprocessors look like. This type of symmetric shared-memory architecture is 
currently by far the most popular organization. The architecture of such multipro- 
cessors is the topic of Section 4.2. 

The second group consists of multiprocessors with physically distributed 
memory. Figure 4.2 shows what these multiprocessors look like. To support 
larger processor counts, memory must be distributed among the processors 
rather than centralized; otherwise the memory system would not be able to sup- 
port the bandwidth demands of a larger number of processors without incurring 
excessively long access latency. With the rapid increase in processor perfor- 
mance and the associated increase in a processor's memory bandwidth require- 
ments, the size of a multiprocessor for which distributed memory is preferred 
continues to shrink. The larger number of processors also raises the need for a 
high-bandwidth interconnect, of which we will see examples in Appendix E. 
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Figure 4.1 Basic structure of a centralized shared-memory multiprocessor. Multiple 
processor-cache subsystems share the same physical memory, typically connected by 
one or more buses or a switch. The key architectural property is the uniform access time 
to all of memory from all the processors. 
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Figure 4.2 The basic architecture of a distributed-memory multiprocessor consists 
of individual nodes containing a processor, some memory, typically some I/O, and 
an interface to an interconnection network that connects all the nodes. Individual 
nodes may contain a small number of processors, which may be interconnected by a 
small bus or a different interconnection technology, which is less scalable than the glo- 
bal interconnection network. 


Both direction networks (i.e., switches) and indirect networks (typically multi- 
dimensional meshes) are used. 

Distributing the memory among the nodes has two major benefits. First, it is a 
cost-effective way to scale the memory bandwidth if most of the accesses are to 
the local memory in the node. Second, it reduces the latency for accesses to the 
local memory. These two advantages make distributed memory attractive at 
smaller processor counts as processors get ever faster and require more memory 
bandwidth and lower memory latency. The key disadvantages for a distributed- 
memory architecture are that communicating data between processors becomes 
somewhat more complex, and that it requires more effort in the software to take 
advantage of the increased memory bandwidth afforded by distributed memories. 
As we will see shortly, the use of distributed memory also leads to two different 
paradigms for interprocessor communication. 


Models for Communication and Memory Architecture 


As discussed earlier, any large-scale multiprocessor must use multiple memories 
that are physically distributed with the processors. There are two alternative 
architectural approaches that differ in the method used for communicating data 
among processors. 

In the first method, communication occurs through a shared address space, as 
it does in a symmetric shared-memory architecture. The physically separate 
memories can be addressed as one logically shared address space, meaning that a 
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Example 


Answer 


memory reference can be made by any processor to any memory location, assum- 
ing it has the correct access rights. These multiprocessors are called distributed 
shared-memory (DSM) architectures. The term shared memory refers to the fact 
that the address space is shared; that is, the same physical address on two proces- 
sors refers to the same location in memory. Shared memory does not mean that 
there is a single, centralized memory. In contrast to the symmetric shared-mem- 
ory multiprocessors, also known as UMAs (uniform memory access), the DSM 
multiprocessors are also called NUMAs (nonuniform memory access), since the 
access time depends on the location of a data word in memory. 

Alternatively, the address space can consist of multiple private address spaces 
that are logically disjoint and cannot be addressed by a remote processor. In such 
multiprocessors, the same physical address on two different processors refers to 
two different locations in two different memories. Each processor-memory mod- 
ule is essentially a separate computer. Initially, such computers were built with 
different processing nodes and specialized interconnection networks. Today, 
most designs of this type are actually clusters, which we discuss in Appendix H. 

With each of these organizations for the address space, there is an associated 
communication mechanism. For a multiprocessor with a shared address space, 
that address space can be used to communicate data implicitly via load and store 
operations—hence the name shared memory for such multiprocessors. For a mul- 
tiprocessor with multiple address spaces, communication of data is done by 
explicitly passing messages among the processors. Therefore, these multiproces- 
sors are often called message-passing multiprocessors. Clusters inherently use 
message passing. 


Challenges of Parallel Processing 


The application of multiprocessors ranges from running independent tasks with 
essentially no communication to running parallel programs where threads must 
communicate to complete the task. Two important hurdles, both explainable with 
Amdahl's Law, make parallel processing challenging. The degree to which these 
hurdles are difficult or easy is determined both by the application and by the 
architecture. 

The first hurdle has to do with the limited parallelism available in programs, 
and the second arises from the relatively high cost of communications. Limita- 
tions in available parallelism make it difficult to achieve good speedups in any 
parallel processor, as our first example shows. 


Suppose you want to achieve a speedup of 80 with 100 processors. What fraction 
of the original computation can be sequential? 


Amdahl's Law is 
l 





Speedup = ——- 
P Fraction... 

enhanced eti- Fracti ) 
Speedup FaCUON enhanced 


enhanced 


Example 


Answer 
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For simplicity in this example, assume that the program operates in only two 
modes: parallel with all processors fully used, which is the enhanced mode, or 
serial with only one processor in use. With this simplification, the speedup in 
enhanced mode is simply the number of processors, while the fraction of 
enhanced mode is the time spent in parallel mode. Substituting into the previous 
equation: 


80 ] 


Fraction 


parallel Taras 
100 +(1-I raction parallel) 


Simplifying this equation yields 


0.8 x FractiOnparanei + 80x(l-FractiOnparaiei) = 1 


80-79.2xFractiOnparailei= 1 


Fracti _ 80-1 
ractiONn parallel ea 79.2 
Fractionparater = 0.9975 


Thus, to achieve a speedup of 80 with 100 processors, only 0.25% of original 
computation can be sequential. Of course, to achieve linear speedup (speedup of 
n with n processors), the entire program must usually be parallel with no serial 
portions. In practice, programs do not just operate in fully parallel or sequential 
mode, but often use less than the full complement of the processors when running 
in parallel mode. 


The second major challenge in parallel processing involves the large latency 
of remote access in a parallel processor. In existing shared-memory multiproces- 
sors, communication of data between processors may cost anywhere from 50 
clock cycles (for multicores) to over 1000 clock cycles (for large-scale multipro- 
cessors), depending on the communication mechanism, the type of interconnec- 
tion network, and the scale of the multiprocessor. The effect of long 
communication delays is clearly substantial. Let's consider a simple example. 


Suppose we have an application running on a 32-processor multiprocessor, which 
has a 200 ns time to handle reference to a remote memory. For this application, 
assume that all the references except those involving communication hit in the 
local memory hierarchy, which is slightly optimistic. Processors are stalled on a 
remote request, and the processor clock rate is 2 GHz. If the base CPI (assuming 
that all references hit in the cache) is 0.5, how much faster is the multiprocessor if 
there is no communication versus if 0.2% of the instructions involve a remote 
communication reference? 


It is simpler to first calculate the CPI. The effective CPI for the multiprocessor 
with 0.2% remote references is 
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CPI = Base CPI + Remote request rate x Remote request cost 


0.5 + 0.2% x Remote request cost 


The remote request cost is 


Remote access cost _ 200 ns = 
= : = = = 400 cycles 
Cycle time 0.5 ns F 





Hence we can compute the CPI: 


CPI = 0.5+0.8=1.3 


The multiprocessor with all local references is 1.3/0.5 = 2.6 times faster. In prac- 
tice, the performance analysis is much more complex, since some fraction of the 
noncommunication references will miss in the local hierarchy and the remote 
access time does not have a single constant value. For example, the cost of a 
remote reference could be quite a bit worse, since contention caused by many ref- 
erences trying to use the global interconnect can lead to increased delays. 


These problems— insufficient parallelism and long-latency remote communi- 
cation—are the two biggest performance challenges in using multiprocessors. 
The problem of inadequate application parallelism must be attacked primarily in 
software with new algorithms that can have better parallel performance. Reduc- 
ing the impact of long remote latency can be attacked both by the architecture 
and by the programmer. For example, we can reduce the frequency of remote 
accesses with either hardware mechanisms, such as caching shared data, or soft- 
ware mechanisms, such as restructuring the data to make more accesses local. We 
can try to tolerate the latency by using multithreading (discussed in Chapter 3 and 
later in this chapter) or by using prefetching (a topic we cover extensively in 
Chapter 5). 

Much of this chapter focuses on techniques for reducing the impact of long 
remote communication latency. For example, Sections 4.2 and 4.3 discuss how 
caching can be used to reduce remote access frequency, while maintaining a 
coherent view of memory. Section 4.5 discusses synchronization, which, because 
it inherently involves interprocessor communication and also can limit parallel- 
ism, is a major potential bottleneck. Section 4.6 covers latency-hiding techniques 
and memory consistency models for shared memory. In Appendix I, we focus pri- 
marily on large-scale multiprocessors, which are used predominantly for scien- 
tific work. In that appendix, we examine the nature of such applications and the 
challenges of achieving speedup with dozens to hundreds of processors. 

Understanding a modern shared-memory multiprocessor requires a good 
understanding of the basics of caches. Readers who have covered this topic in 
our introductory book, Computer Organization and Design: The Hardware/ 
Software Interface, will be well-prepared. If topics such as write-back caches 
and multilevel caches are unfamiliar to you, you should take the time to review 
Appendix C. 
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Symmetric Shared-Memory Architectures 


The use of large, multilevel caches can substantially reduce the memory band- 
width demands of a processor. If the main memory bandwidth demands of a sin- 
gle processor are reduced, multiple processors may be able to share the same 
memory. Starting in the 1980s, this observation, combined with the emerging 
dominance of the microprocessor, motivated many designers to create small- 
scale multiprocessors where several processors shared a single physical memory 
connected by a shared bus. Because of the small size of the processors and the 
significant reduction in the requirements for bus bandwidth achieved by large 
caches, such symmetric multiprocessors were extremely cost-effective, provided 
that a sufficient amount of memory bandwidth existed. Early designs of such 
multiprocessors were able to place the processor and cache subsystem on a 
board, which plugged into the bus backplane. Subsequent versions of such 
designs in the 1990s could achieve higher densities with two to four processors 
per board, and often used multiple buses and interleaved memories to support the 
faster processors. 

IBM introduced the first on-chip multiprocessor for the general-purpose com- 
puting market in 2000. AMD and Intel followed with two-processor versions for 
the server market in 2005, and Sun introduced T1, an eight-processor multicore 
in 2006. Section 4.8 looks at the design and performance of T1. The earlier 
Figure 4.1 on page 200 shows a simple diagram of such a multiprocessor. With 
the more recent, higher-performance processors, the memory demands have out- 
stripped the capability of reasonable buses. As a result, most recent designs use a 
small-scale switch or a limited point-to-point network. 

Symmetric shared-memory machines usually support the caching of both 
shared and private data. Private data are used by a single processor, while shared 
data are used by multiple processors, essentially providing communication 
among the processors through reads and writes of the shared data. When a private 
item is cached, its location is migrated to the cache, reducing the average access 
time as well as the memory bandwidth required. Since no other processor uses 
the data, the program behavior is identical to that in a uniprocessor. When shared 
data are cached, the shared value may be replicated in multiple caches. In addi- 
tion to the reduction in access latency and required memory bandwidth, this rep- 
lication also provides a reduction in contention that may exist for shared data 
items that are being read by multiple processors simultaneously. Caching of 
shared data, however, introduces a new problem: cache coherence. 


What Is Multiprocessor Cache Coherence? 


Unfortunately, caching shared data introduces a new problem because the view of 
memory held by two different processors is through their individual caches, 
which, without any additional precautions, could end up seeing two different val- 
ues. Figure 4.3 illustrates the problem and shows how two different processors 
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Figure 4.3 The cache coherence problem for a single memory location (X), read and 
written by two processors (A and B). We initially assume that neither cache contains 
the variable and that X has the value 1.We also assume a write-through cache; a write- 
back cache adds some additional but similar complications. After the value of X has 
been written by A, As cache and the memory both contain the new value, but Bs cache 
does not, and if B reads the value of X, it will receive 1! 


can have two different values for the same location. This difficulty is generally 
referred to as the cache coherence problem. 

Informally, we could say that a memory system is coherent if any read of a 
data item returns the most recently written value of that data item. This definition, 
although intuitively appealing, is vague and simplistic; the reality is much more 
complex. This simple definition contains two different aspects of memory system 
behavior, both of which are critical to writing correct shared-memory programs. 
The first aspect, called coherence, defines what values can be returned by a read. 
The second aspect, called consistency, determines when a written value will be 
returned by a read. Let's look at coherence first. 

A memory system is coherent if 


1. A read by a processor P to a location X that follows a write by P to X, with no 
writes of X by another processor occurring between the write and the read by 
P, always returns the value written by P. 


2. A read by a processor to location X that follows a write by another processor 
to X returns the written value if the read and write are sufficiently separated 
in time and no other writes to X occur between the two accesses. 


3. Writes to the same location are serialized; that is, two writes to the same 
location by any two processors are seen in the same order by all processors. 
For example, if the values | and then 2 are written to a location, processors 
can never read the value of the location as 2 and then later read it as 1. 


The first property simply preserves program order—we expect this property 
to be true even in uniprocessors. The second property defines the notion of what 
it means to have a coherent view of memory: If a processor could continuously 
read an old data value, we would clearly say that memory was incoherent. 

The need for write serialization is more subtle, but equally important. Sup- 
pose we did not serialize writes, and processor PI writes location X followed by 
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P2 writing location X. Serializing the writes ensures that every processor will see 
the write done by P2 at some point. If we did not serialize the writes, it might be 
the case that some processor could see the write of P2 first and then see the write 
of PI, maintaining the value written by PI indefinitely. The simplest way to avoid 
such difficulties is to ensure that all writes to the same location are seen in the 
same order; this property is called write serialization. 

Although the three properties just described are sufficient to ensure coher- 
ence, the question of when a written value will be seen is also important. To see 
why, observe that we cannot require that a read of X instantaneously see the value 
written for X by some other processor. If, for example, a write of X on one pro- 
cessor precedes a read of X on another processor by a very small time, it may be 
impossible to ensure that the read returns the value of the data written, since the 
written data may not even have left the processor at that point. The issue of 
exactly when a written value must be seen by a reader is defined by a memory 
consistency model—a topic discussed in Section 4.6. 

Coherence and consistency are complementary: Coherence defines the 
behavior of reads and writes to the same memory location, while consistency 
defines the behavior of reads and writes with respect to accesses to other memory 
locations. For now, make the following two assumptions. First, a write does not 
complete (and allow the next write to occur) until all processors have seen the 
effect of that write. Second, the processor does not change the order of any write 
with respect to any other memory access. These two conditions mean that if a 
processor writes location A followed by location B, any processor that sees the 
new value of B must also see the new value of A. These restrictions allow the pro- 
cessor to reorder reads, but forces the processor to finish a write in program order. 
We will rely on this assumption until we reach Section 4.6, where we will see 
exactly the implications of this definition, as well as the alternatives. 


Basic Schemes for Enforcing Coherence 


The coherence problem for multiprocessors and I/O, although similar in origin, 
has different characteristics that affect the appropriate solution. Unlike I/O. 
where multiple data copies are a rare event—one to be avoided whenever possi- 
ble—a program running on multiple processors will normally have copies of the 
same data in several caches. In a coherent multiprocessor, the caches provide 
both migration and replication of shared data items. 

Coherent caches provide migration, since a data item can be moved to a local 
cache and used there in a transparent fashion. This migration reduces both the 
latency to access a shared data item that is allocated remotely and the bandwidth 
demand on the shared memory. 

Coherent caches also provide replication for shared data that are being 
simultaneously read, since the caches make a copy of the data item in the local 
cache. Replication reduces both latency of access and contention for a read 
shared data item. Supporting this migration and replication is critical to perfor- 
mance in accessing shared data. Thus, rather than trying to solve the problem by 


208 


Chapter Four Multiprocessors and Thread-Level Parallelism 


avoiding it in software, small-scale multiprocessors adopt a hardware solution by 
introducing a protocol to maintain coherent caches. 

The protocols to maintain coherence for multiple processors are called cache 
coherence protocols. Key to implementing a cache coherence protocol is tracking 
the state of any sharing of a data block. There are two classes of protocols, which 
use different techniques to track the sharing status, in use: 


e Directory based—The sharing status of a block of physical memory is kept in 
just one location, called the directory; we focus on this approach in Section 
4.4, when we discuss scalable shared-memory architecture. Directory-based 
coherence has slightly higher implementation overhead than snooping, but it 
can scale to larger processor counts. The Sun TI design, the topic of Section 
4.8, uses directories, albeit with a central physical memory. 


e Snooping—Every cache that has a copy of the data from a block of physical 
memory also has a copy of the sharing status of the block, but no centralized 
state is kept. The caches are all accessible via some broadcast medium (a bus 
or switch), and all cache controllers monitor or snoop on the medium to 
determine whether or not they have a copy of a block that is requested on a 
bus or switch access. We focus on this approach in this section. 


Snooping protocols became popular with multiprocessors using microproces- 
sors and caches attached to a single shared memory because these protocols can 
use a preexisting physical connection—the bus to memory—to interrogate the 
status of the caches. In the following section we explain snoop-based cache 
coherence as implemented with a shared bus, but any communication medium 
that broadcasts cache misses to all processors can be used to implement a snoop- 
ing-based coherence scheme. This broadcasting to all caches is what makes 
snooping protocols simple to implement but also limits their scalability. 


Snooping Protocols 


There are two ways to maintain the coherence requirement described in the prior 
subsection. One method is to ensure that a processor has exclusive access to a 
data item before it writes that item. This style of protocol is called a write invali- 
date protocol because it invalidates other copies on a write. It is by far the most 
common protocol, both for snooping and for directory schemes. Exclusive access 
ensures that no other readable or writable copies of an item exist when the write 
occurs: All other cached copies of the item are invalidated. 

Figure 4.4 shows an example of an invalidation protocol for a snooping bus 
with write-back caches in action. To see how this protocol ensures coherence, con- 
sider a write followed by a read by another processor: Since the write requires 
exclusive access, any copy held by the reading processor must be invalidated 
(hence the protocol name). Thus, when the read occurs, it misses in the cache and 
is forced to fetch a new copy of the data. For a write, we require that the writing 
processor have exclusive access, preventing any other processor from being able 
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Contents of 
Contents of Contents of memory 

















Processor activity Bus activity CPU A'scache CPU B'scache location X 
0 
CPU A reads X Cache miss for X 0 0 
CPU B reads X Cache miss for X 0 0 0 
CPU A writes a 1 to X Invalidation for X 1 0 
CPU B reads X Cache miss for X 1 1 1 





Figure 4.4 An example of an invalidation protocol working on a snooping bus for a 
single cache block (X) with write-back caches. We assume that neither cache initially 
holds X and that the value of X in memory is O.The CPU and memory contents show the 
value after the processor and bus activity have both completed. A blank indicates no 
activity or no copy cached. When the second miss by B occurs, CPU A responds with the 
value canceling the response from memory. In addition, both the contents of Bs cache 
and the memory contents of X are updated. This update of memory, which occurs when 
a block becomes shared, simplifies the protocol, but it is possible to track the owner- 
ship and force the write back only if the block is replaced. This requires the introduction 
of an additional state called "owner," which indicates that a block may be shared, but 
the owning processor is responsible for updating any other processors and memory 
when it changes the block or replaces it. 


to write simultaneously. If two processors do attempt to write the same data simul- 
taneously, one of them wins the race (we'll see how we decide who wins shortly), 
causing the other processor's copy to be invalidated. For the other processor to 
complete its write, it must obtain a new copy of the data, which must now contain 
the updated value. Therefore, this protocol enforces write serialization. 

The alternative to an invalidate protocol is to update all the cached copies of a 
data item when that item is written. This type of protocol is called a write update 
or write broadcast protocol. Because a write update protocol must broadcast all 
writes to shared cache lines, it consumes considerably more bandwidth. For this 
reason, all recent multiprocessors have opted to implement a write invalidate pro- 
tocol, and we will focus only on invalidate protocols for the rest of the chapter. 


Basic Implementation Techniques 


The key to implementing an invalidate protocol in a small-scale multiprocessor is 
the use of the bus, or another broadcast medium, to perform invalidates. To per- 
form an invalidate, the processor simply acquires bus access and broadcasts the 
address to be invalidated on the bus. All processors continuously snoop on the 
bus, watching the addresses. The processors check whether the address on the bus 
is in their cache. If so, the corresponding data in the cache are invalidated. 

When a write to a block that is shared occurs, the writing processor must 
acquire bus access to broadcast its invalidation. If two processors attempt to write 
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shared blocks at the same time, their attempts to broadcast an invalidate operation 
will be serialized when they arbitrate for the bus. The first processor to obtain bus 
access will cause any other copies of the block it is writing to be invalidated. If 
the processors were attempting to write the same block, the serialization enforced 
by the bus also serializes their writes. One implication of this scheme is that a 
write to a shared data item cannot actually complete until it obtains bus access. 
All coherence schemes require some method of serializing accesses to the same 
cache block, either by serializing access to the communication medium or 
another shared structure. 

In addition to invalidating outstanding copies of a cache block that is being 
written into, we also need to locate a data item when a cache miss occurs. In a 
write-through cache, it is easy to find the recent value of a data item, since all 
written data are always sent to the memory, from which the most recent value of 
a data item can always be fetched. (Write buffers can lead to some additional 
complexities, which are discussed in the next chapter.) In a design with adequate 
memory bandwidth to support the write traffic from the processors, using write 
through simplifies the implementation of cache coherence. 

For a write-back cache, the problem of finding the most recent data value is 
harder, since the most recent value of a data item can be in a cache rather than in 
memory. Happily, write-back caches can use the same snooping scheme both for 
cache misses and for writes: Each processor snoops every address placed on the 
bus. If a processor finds that it has a dirty copy of the requested cache block, it 
provides that cache block in response to the read request and causes the memory 
access to be aborted. The additional complexity comes from having to retrieve 
the cache block from a processor's cache, which can often take longer than 
retrieving it from the shared memory if the processors are in separate chips. Since 
write-back caches generate lower requirements for memory bandwidth, they can 
support larger numbers of faster processors and have been the approach chosen in 
most multiprocessors, despite the additional complexity of maintaining coher- 
ence. Therefore, we will examine the implementation of coherence with write- 
back caches. 

The normal cache tags can be used to implement the process of snooping, and 
the valid bit for each block makes invalidation easy to implement. Read misses, 
whether generated by an invalidation or by some other event, are also straightfor- 
ward since they simply rely on the snooping capability. For writes we'd like to 
know whether any other copies of the block are cached because, if there are no 
other cached copies, then the write need not be placed on the bus in a write-back 
cache. Not sending the write reduces both the time taken by the write and the 
required bandwidth. 

To track whether or not a cache block is shared, we can add an extra state bit 
associated with each cache block, just as we have a valid bit and a dirty bit. By 
adding a bit indicating whether the block is shared, we can decide whether a 
write must generate an invalidate. When a write to a block in the shared state 
occurs, the cache generates an invalidation on the bus and marks the block as 
exclusive. No further invalidations will be sent by that processor for that block. 
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The processor with the sole copy of a cache block is normally called the owner of 
the cache block. 

When an invalidation is sent, the state of the owner's cache block is changed 
from shared to unshared (or exclusive). If another processor later requests this 
cache block, the state must be made shared again. Since our snooping cache also 
sees any misses, it knows when the exclusive cache block has been requested by 
another processor and the state should be made shared. 

Every bus transaction must check the cache-address tags, which could poten- 
tially interfere with processor cache accesses. One way to reduce this interference 
is to duplicate the tags. The interference can also be reduced in a multilevel cache 
by directing the snoop requests to the L2 cache, which the processor uses only 
when it has a miss in the LI cache. For this scheme to work, every entry in the LI 
cache must be present in the L2 cache, a property called the inclusion property. If 
the snoop gets a hit in the L2 cache, then it must arbitrate for the LI cache to update 
the state and possibly retrieve the data, which usually requires a stall of the proces- 
sor. Sometimes it may even be useful to duplicate the tags of the secondary cache to 
further decrease contention between the processor and the snooping activity. We 
discuss the inclusion property in more detail in the next chapter. 


An Example Protocol 


A snooping coherence protocol is usually implemented by incorporating a finite- 
state controller in each node. This controller responds to requests from the pro- 
cessor and from the bus (or other broadcast medium), changing the state of the 
selected cache block, as well as using the bus to access data or to invalidate it. 
Logically, you can think of a separate controller being associated with each 
block; that is, snooping operations or cache requests for different blocks can pro- 
ceed independently. In actual implementations, a single controller allows multi- 
ple operations to distinct blocks to proceed in interleaved fashion (that is, one 
operation may be initiated before another is completed, even though only one 
cache access or one bus access is allowed at a time). Also, remember that, 
although we refer to a bus in the following description, any interconnection net- 
work that supports a broadcast to all the coherence controllers and their associ- 
ated caches can be used to implement snooping. 

The simple protocol we consider has three states: invalid, shared, and modi- 
fied. The shared state indicates that the block is potentially shared, while the 
modified state indicates that the block has been updated in the cache; note that 
the modified state implies that the block is exclusive. Figure 4.5 shows the 
requests generated by the processor-cache module in a node (in the top half of the 
table) as well as those coming from the bus (in the bottom half of the table). This 
protocol is for a write-back cache but is easily changed to work for a write- 
through cache by reinterpreting the modified state as an exclusive state and 
updating the cache on writes in the normal fashion for a write-through cache. The 
most common extension of this basic protocol is the addition of an exclusive 
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state, which describes a block that is unmodified but held in only one cache; the 
caption of Figure 4.5 describes this state and its addition in more detail. 

When an invalidate or a write miss is placed on the bus, any processors with 
copies of the cache block invalidate it. For a write-through cache, the data for a 
write miss can always be retrieved from the memory. For a write miss in a write- 
back cache, if the block is exclusive in just one cache, that cache also writes back 
the block; otherwise, the data can be read from memory. 

Figure 4.6 shows a finite-state transition diagram for a single cache block 
using a write invalidation protocol and a write-back cache. For simplicity, the 
three states of the protocol are duplicated to represent transitions based on pro- 
cessor requests (on the left, which corresponds to the top half of the table in Fig- 
ure 4.5), as opposed to transitions based on bus requests (on the right, which 
corresponds to the bottom half of the table in Figure 4.5). Boldface type is used 
to distinguish the bus actions, as opposed to the conditions on which a state tran- 
sition depends. The state in each node represents the state of the selected cache 
block specified by the processor or bus request. 

All of the states in this cache protocol would be needed in a uniprocessor 
cache, where they would correspond to the invalid, valid (and clean), and dirty 
states. Most of the state changes indicated by arcs in the left half of Figure 4.6 
would be needed in a write-back uniprocessor cache, with the exception being 
the invalidate on a write hit to a shared block. The state changes represented by 
the arcs in the right half of Figure 4.6 are needed only for coherence and would 
not appear at all in a uniprocessor cache controller. 

As mentioned earlier, there is only one finite-state machine per cache, with 
stimuli coming either from the attached processor or from the bus. Figure 4.7 
shows how the state transitions in the right half of Figure 4.6 are combined 
with those in the left half of the figure to form a single state diagram for each 
cache block. 

To understand why this protocol works, observe that any valid cache block 
is either in the shared state in one or more caches or in the exclusive state in 
exactly one cache. Any transition to the exclusive state (which is required for a 
processor to write to the block) requires an invalidate or write miss to be placed 
on the bus, causing all caches to make the block invalid. In addition, if some 
other cache had the block in exclusive state, that cache generates a write back, 
which supplies the block containing the desired address. Finally, if a read miss 
occurs on the bus to a block in the exclusive state, the cache with the exclusive 
copy changes its state to shared. 

The actions in gray in Figure 4.7, which handle read and write misses on the 
bus, are essentially the snooping component of the protocol. One other property 
that is preserved in this protocol, and in most other protocols, is that any memory 
block in the shared state is always up to date in the memory, which simplifies the 
implementation. 

Although our simple cache protocol is correct, it omits a number of complica- 
tions that make the implementation much trickier. The most important of these is 
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State of 
addressed Type of 
Request Source cache block cache action Function and explanation 




















Read hit processor shared or normal hit Read data in cache. 
modified 
Read miss processor invalid normal miss Place read miss on bus. 
Read miss processor shared replacement Address conflict miss: place read miss on bus. 
Read miss processor modified replacement Address conflict miss: write back block, then place read miss on 
bus. 
Write hit processor modified normal hit Write data in cache. 
Write hit processor shared coherence Place invalidate on bus. These operations are often called 


upgrade or ownership misses, since they do not fetch the data but 
only change the state. 


























Write miss processor invalid normal miss Place write miss on bus. 

Write miss processor shared replacement Address conflict miss: place write miss on bus. 

Write miss processor modified replacement Address conflict miss: write back block, then place write miss on 
bus. 

Read miss bus shared no action Allow memory to service read miss. 

Read miss bus modified coherence Attempt to share data: place cache block on bus and change state 
to shared. 

Invalidate bus shared coherence Attempt to write shared block; invalidate the block. 

Write miss bus shared coherence Attempt to write block that is shared; invalidate the cache block. 

Write miss bus modified coherence Attempt to write block that is exclusive elsewhere: write back the 


cache block and make its state invalid. 





Figure 4.5 The cache coherence mechanism receives requests from both the processor and the bus and 
responds to these based on the type of request, whether it hits or misses in the cache, and the state of the cache 
block specified in the request.The fourth column describes the type of cache action as normal hit or miss (the same 
as a uniprocessor cache would see), replacement (a uniprocessor cache replacement miss),or coherence (required to 
maintain cache coherence); a normal or replacement action may cause a coherence action depending on the state of 
the block in other caches. For read, misses, write misses, or invalidates snooped from the bus, an action is required 
only if the read or write addresses match a block in the cache and the block is valid. Some protocols also introduce a 
state to designate when a block is exclusively in one cache but has not yet been written.This state can arise if a write 
access is broken into two pieces: getting the block exclusively in one cache and then subsequently updating it; in 
such a protocol this "exclusive unmodified state" is transient, ending as soon as the write is completed. Other proto- 
cols use and maintain an exclusive state for an unmodified block. In a snooping protocol, this state can be entered 
when a processor reads a block that is not resident in any other cache. Because all subsequent accesses are snooped, 
it is possible to maintain the accuracy of this state. In particular, if another processor issues a read miss, the state is 
changed from exclusive to shared. The advantage of adding this state is that a subsequent write to a block in the 
exclusive state by the same processor need not acquire bus access or generate an invalidate, since the block is 
known to be exclusively in this cache; the processor merely changes the state to modified.This state is easily added 
by using the bit that encodes the coherent state as an exclusive state and using the dirty bit to indicate that a bock is 
modified. The popular MESI protocol, which is named for the four states it includes (modified, exclusive, shared, and 
invalid), uses this structure. The MOESI protocol introduces another extension: the "owned" state, as described in the 
caption of Figure 4.4. 
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Figure 4.6 A write invalidate, cache coherence protocol for a write-back cache showing the states and state tran- 
sitions for each block in the cache. The cache states are shown in circles, with any access permitted by the processor 
without a state transition shown in parentheses under the name of the state. The stimulus causing a state change is 
shown on the transition arcs in regular type, and any bus actions generated as part of the state transition are shown 
on the transition arc in bold. The stimulus actions apply to a block in the cache, not to a specific address in the cache. 
Hence, a read miss to a block in the shared state is a miss for that cache block but for a different address. The left side 
of the diagram shows state transitions based on actions of the processor associated with this cache; the right side 
shows transitions based on operations on the bus. A read miss in the exclusive or shared state and a write miss in the 
exclusive state occur when the address requested by the processor does not match the address in the cache block. 
Such a miss is a standard cache replacement miss. An attempt to write a block in the shared state generates an inval- 
idate. Whenever a bus transaction occurs, all caches that contain the cache block specified in the bus transaction 
take the action dictated by the right half of the diagram.The protocol assumes that memory provides data on a read 
miss for a block that is clean in all caches. In actual implementations, these two sets of state diagrams are combined. 
In practice, there are many subtle variations on invalidate protocols, including the introduction of the exclusive 
unmodified state, as to whether a processor or memory provides data on a miss. 


that the protocol assumes that operations are atomic—that is, an operation can be 
done in such a way that no intervening operation can occur. For example, the pro- 
tocol described assumes that write misses can be detected, acquire the bus, and 
receive aresponse as a single atomic action. In reality this is not true. Similarly, if 
we used a switch, as all recent multiprocessors do, then even read misses would 
also not be atomic. 

Nonatomic actions introduce the possibility that the protocol can deadlock, 
meaning that it reaches a state where it cannot continue. We will explore how 
these protocols are implemented without a bus shortly. 
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Figure 4.7 Cache coherence state diagram with the state transitions induced by the 
local processor shown in black and by the bus activities shown in gray. As in 
Figure 4.6, the activities on a transition are shown in bold. 


Constructing small-scale (two to four processors) multiprocessors has 
become very easy. For example, the Intel Pentium 4 Xeon and AMD Opteron 
processors are designed for use in cache-coherent multiprocessors and have an 
external interface that supports snooping and allows two to four processors to be 
directly connected. They also have larger on-chip caches to reduce bus utiliza- 
tion. In the case of the Opteron processors, the support for interconnecting multi- 
ple processors is integrated onto the processor chip, as are the memory interfaces. 
In the case of the Intel design, a two-processor system can be built with only a 
few additional external chips to interface with the memory system and I/O. 
Although these designs cannot be easily scaled to larger processor counts, they 
offer an extremely cost-effective solution for two to four processors. 

The next section examines the performance of these protocols for our parallel 
and multiprogrammed workloads; the value of these extensions to a basic proto- 
col will be clear when we examine the performance. But before we do that, let's 
take a brief look at the limitations on the use of a symmetric memory structure 
and a snooping coherence scheme. 
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Limitations in Symmetric Shared-Memory Multiprocessors and 
Snooping Protocols 


As the number of processors in a multiprocessor grows, or as the memory 
demands of each processor grow, any centralized resource in the system can 
become a bottleneck. In the simple case of a bus-based multiprocessor, the bus 
must support both the coherence traffic as well as normal memory traffic arising 
from the caches. Likewise, if there is a single memory unit, it must accommodate 
all processor requests. As processors have increased in speed in the last few 
years, the number of processors that can be supported on a single bus or by using 
a single physical memory unit has fallen. 

How can a designer increase the memory bandwidth to support either more or 
faster processors? To increase the communication bandwidth between processors 
and memory, designers have used multiple buses as well as interconnection net- 
works, such as crossbars or small point-to-point networks. In such designs, the 
memory system can be configured into multiple physical banks, so as to boost the 
effective memory bandwidth while retaining uniform access time to memory. 
Figure 4.8 shows this approach, which represents a midpoint between the two 
approaches we discussed in the beginning of the chapter: centralized shared 
memory and distributed shared memory. 

The AMD Opteron represents another intermediate point in the spectrum 
between a snoopy and a directory protocol. Memory is directly connected to each 
dual-core processor chip, and up to four processor chips, eight cores in total, can 
be connected. The Opteron implements its coherence protocol using the point-to- 
point links to broadcast up to three other chips. Because the interprocessor links 
are not shared, the only way a processor can know when an invalid operation has 
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Figure 4.8 A multiprocessor with uniform memory access using an interconnection 
network rather than a bus. 
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completed is by an explicit acknowledgment. Thus, the coherence protocol uses a 
broadcast to find potentially shared copies, like a snoopy protocol, but uses the 
acknowledgments to order operations, like a directory protocol. Interestingly, the 
remote memory latency and local memory latency are not dramatically different, 
allowing the operating system to treat an Opteron multiprocessor as having uni- 
form memory access. 

A snoopy cache coherence protocol can be used without a centralized bus, but 
still requires that a broadcast be done to snoop the individual caches on every 
miss to a potentially shared cache block. This cache coherence traffic creates 
another limit on the scale and the speed of the processors. Because coherence 
traffic is unaffected by larger caches, faster processors will inevitably overwhelm 
the network and the ability of each cache to respond to snoop requests from all 
the other caches. In Section 4.4, we examine directory-based protocols, which 
eliminate the need for broadcast to all caches on a miss. As processor speeds and 
the number of cores per processor increase, more designers are likely to opt for 
such protocols to avoid the broadcast limit of a snoopy protocol. 


Implementing Snoopy Cache Coherence 


The devilis in the details. 
Classic proverb 


When we wrote the first edition of this book in 1990, our final "Putting It All 
Together" was a 30-processor, single bus multiprocessor using snoop-based 
coherence; the bus had a capacity of just over 50 MB/sec, which would not be 
enough bus bandwidth to support even one Pentium 4 in 2006! When we wrote 
the second edition of this book in 1995, the first cache coherence multiprocessors 
with more than a single bus had recently appeared, and we added an appendix 
describing the implementation of snooping in a system with multiple buses. In 
2006, every multiprocessor system with more than two processors uses an inter- 
connect other than a single bus, and designers must face the challenge of imple- 
menting snooping without the simplification of a bus to serialize events. 

As we said earlier, the major complication in actually implementing the 
snooping coherence protocol we have described is that write and upgrade misses 
are not atomic in any recent multiprocessor. The steps of detecting a write or up- 
grade miss, communicating with the other processors and memory, getting the 
most recent value for a write miss and ensuring that any invalidates are pro- 
cessed, and updating the cache cannot be done as if they took a single cycle. 

In a simple single-bus system, these steps can be made effectively atomic by 
arbitrating for the bus first (before changing the cache state) and not releasing the 
bus until all actions are complete. How can the processor know when all the in- 
validates are complete? In most bus-based multiprocessors a single line is used to 
signal when all necessary invalidates have been received and are being processed. 
Following that signal, the processor that generated the miss can release the bus, 
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4.3 


knowing that any required actions will be completed before any activity related to 
the next miss. By holding the bus exclusively during these steps, the processor ef- 
fectively makes the individual steps atomic. 

In a system without a bus, we must find some other method of making the 
steps in a miss atomic. In particular, we must ensure that two processors that at- 
tempt to write the same block at the same time, a situation which is called a race, 
are strictly ordered: one write is processed and precedes before the next is begun. 
It does not matter which of two writes in a race wins the race, just that there be 
only a single winner whose coherence actions are completed first. In a snoopy 
system ensuring that a race has only one winner is ensured by using broadcast for 
all misses as well as some basic properties of the interconnection network. These 
properties, together with the ability to restart the miss handling of the loser in a 
race, are the keys to implementing snoopy cache coherence without a bus. We ex- 
plain the details in Appendix H. 


Performance of Symmetric Shared-Memory 
Multiprocessors 


In a multiprocessor using a snoopy coherence protocol, several different phenom- 
ena combine to determine performance. In particular, the overall cache perfor- 
mance is a combination of the behavior of uniprocessor cache miss traffic and the 
traffic caused by communication, which results in invalidations and subsequent 
cache misses. Changing the processor count, cache size, and block size can affect 
these two components of the miss rate in different ways, leading to overall sys- 
tem behavior that is a combination of the two effects. 

Appendix C breaks the uniprocessor miss rate into the three C's classification 
(capacity, compulsory, and conflict) and provides insight into both application 
behavior and potential improvements to the cache design. Similarly, the misses 
that arise from interprocessor communication, which are often called coherence 
misses, can be broken into two separate sources. 

The first source is the so-called true sharing misses that arise from the com- 
munication of data through the cache coherence mechanism. In an invalidation- 
based protocol, the first write by a processor to a shared cache block causes an 
invalidation to establish ownership of that block. Additionally, when another pro- 
cessor attempts to read a modified word in that cache block, a miss occurs and the 
resultant block is transferred. Both these misses are classified as true sharing 
misses since they directly arise from the sharing of data among processors. 

The second effect, called false sharing, arises from the use of an invalidation- 
based coherence algorithm with a single valid bit per cache block. False sharing 
occurs when a block is invalidated (and a subsequent reference causes a miss) 
because some word in the block, other than the one being read, is written into. If 
the word written into is actually used by the processor that received the invali- 
date, then the reference was a true sharing reference and would have caused a 
miss independent of the block size. If, however, the word being written and the 
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word read are different and the invalidation does not cause a new value to be 
communicated, but only causes an extra cache miss, then it is a false sharing 
miss. In a false sharing miss, the block is shared, but no word in the cache is actu- 
ally shared, and the miss would not occur if the block size were a single word. 
The following example makes the sharing patterns clear. 


Assume that words xl and x2 are in the same cache block, which is in the shared 
state in the caches of both PI and P2. Assuming the following sequence of 
events, identify each miss as a true sharing miss, a false sharing miss, or a hit. 
Any miss that would occur if the block size were one word is designated a true 
sharing miss. 




















Time Pl P2 

1 Write xl 

2 Read x2 
3 Write xl 

4 Write x2 
5 Read x2 





Here are classifications by time step: 


1. This event is a true sharing miss, since xl was read by P2 and needs to be 
invalidated from P2. 


2. This event is a false sharing miss, since x2 was invalidated by the write of xl 
in PI, but that value of xl is not used in P2. 


3. This event is a false sharing miss, since the block containing xl is marked 
shared due to the read in P2, but P2 did not read x1. The cache block contain- 
ing xl will be in the shared state after the read by P2; a write miss is required 
to obtain exclusive access to the block. In some protocols this will be handled 
as an upgrade request, which generates a bus invalidate, but does not transfer 
the cache block. 


This event is a false sharing miss for the same reason as step 3. 


This event is a true sharing miss, since the value being read was written by P2. 


Although we will see the effects of true and false sharing misses in commer- 
cial workloads, the role of coherence misses is more significant for tightly cou- 
pled applications that share significant amounts of user data. We examine their 
effects in detail in Appendix H, when we consider the performance of a parallel 
scientific workload. 
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A Commercial Workload 


In this section, we examine the memory system behavior of a four-processor 
shared-memory multiprocessor. The results were collected either on an Alpha- 
Server 4100 or using a configurable simulator modeled after the AlphaServer 
4100. Each processor in the AlphaServer 4100 is an Alpha 21164, which issues 
up to four instructions per clock and runs at 300 MHz. Although the clock rate of 
the Alpha processor in this system is considerably slower than processors in 
recent systems, the basic structure of the system, consisting of a four-issue pro- 
cessor and a three-level cache hierarchy, is comparable to many recent systems. 
In particular, each processor has a three-level cache hierarchy: 


e LI consists of a pair of 8 KB direct-mapped on-chip caches, one for instruc- 
tion and one for data. The block size is 32 bytes, and the data cache is write 
through to L2, using a write buffer. 


e 2 is a 96 KB on-chip unified three-way set associative cache with a 32-byte 
block size, using write back. 


e 13 is an off-chip, combined, direct-mapped 2 MB cache with 64-byte blocks 
also using write back. 


The latency for an access to L2 is 7 cycles, to L3 it is 21 cycles, and to main 
memory it is 80 clock cycles (typical without contention). Cache-to-cache trans- 
fers, which occur on a miss to an exclusive block held in another cache, require 
125 clock cycles. Although these miss penalties are smaller than today's higher 
clock systems would experience, the caches are also smaller, meaning that a more 
recent system would likely have lower miss rates but higher miss penalties. 

The workload used for this study consists of three applications: 


1. An online transaction-processing workload (OLTP) modeled after TPC-B 
(which has similar memory behavior to its newer cousin TPC-C) and using 
Oracle 7.3.2 as the underlying database. The workload consists of a set of cli- 
ent processes that generate requests and a set of servers that handle them. The 
server processes consume 85% of the user time, with the remaining going to 
the clients. Although the I/O latency is hidden by careful tuning and enough 
requests to keep the CPU busy, the server processes typically block for I/O 
after about 25,000 instructions. 


2. A decision support system (DSS) workload based on TPC-D and also using 
Oracle 7.3.2 as the underlying database. The workload includes only 6 of the 
17 read queries in TPC-D, although the 6 queries examined in the benchmark 
span the range of activities in the entire benchmark. To hide the I/O latency, 
parallelism is exploited both within queries, where parallelism is detected 
during a query formulation process, and across queries. Blocking calls are 
much less frequent than in the OLTP benchmark; the 6 queries average about 
15 million instructions before blocking. 
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Figure 4.9 The distribution of execution time in the commercial workloads. The 
OLTP benchmark has the largest fraction of both OS time and CPU idle time (which is 
VO wait time). The DSS benchmark shows much less OS time, since it does less I/O, 
but still more than 9% idle time. The extensive tuning of the AltaVista search engine is 
clear in these measurements. The data for this workload were collected by Barroso et 
al. [1998] on a four-processor AlphaServer 4100. 


3. A Web index search (AltaVista) benchmark based on a search of a memory- 
mapped version of the AltaVista database (200 GB). The inner loop is heavily 
optimized. Because the search structure is static, little synchronization is 
needed among the threads. 


The percentages of time spent in user mode, in the kernel, and in the idle loop 
are shown in Figure 4.9. The frequency of I/O increases both the kernel time and 
the idle time (see the OLTP entry, which has the largest I/O-to-computation 
ratio). AltaVista, which maps the entire search database into memory and has 
been extensively tuned, shows the least kernel or idle time. 


Performance Measurements of the Commercial Workload 


We start by looking at the overall CPU execution for these benchmarks on the 
four-processor system; as discussed on page 220, these benchmarks include sub- 
stantial I/O time, which is ignored in the CPU time measurements. We group the 
six DSS queries as a single benchmark, reporting the average behavior. The 
effective CPI varies widely for these benchmarks, from a CPI of 13 for the 
AltaVista Web search, to an average CPI of 1.6 for the DSS workload, to 7.0 for 
the OLTP workload. Figure 4.10 shows how the execution time breaks down into 
instruction execution, cache and memory system access time, and other stalls 
(which are primarily pipeline resource stalls, but also include TLB and branch 
mispredict stalls). Although the performance of the DSS and AltaVista workloads 
is reasonable, the performance of the OLTP workload is very poor, due to a poor 
performance of the memory hierarchy. 

Since the OLTP workload demands the most from the memory system with 
large numbers of expensive L3 misses, we focus on examining the impact of L3 
cache size, processor count, and block size on the OLTP benchmark. Figure 4.11 
shows the effect of increasing the cache size, using two-way set associative cach- 
es, which reduces the large number of conflict misses. The execution time is im- 
proved as the L3 cache grows due to the reduction in L3 misses. Surprisingly, 
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Figure 4.10 The execution time breakdown for the three programs (OLTP, DSS, and 
AltaVista) in the commercial workload.The DSS numbers are the average across six 
different queries. The CPI varies widely from a low of 1.3 for AltaVista, to 1.61 for the DSS 
queries, to 7.0 for OLTP. (Individually, the DSS queries show a CPI range of 13 to 19) 
Other stalls includes resource stalls (implemented with replay traps on the 21164), 
branch mispredict, memory barrier, and TLB misses. For these benchmarks, resource- 
based pipeline stalls are the dominant factor. These data combine the behavior of user 
and kernel accesses. Only OLTP has a significant fraction of kernel accesses, and the ker- 
nel accesses tend to be better behaved than the user accesses! All the measurements 
shown in this section were collected by Barroso, Gharachorloo, and Bugnion [1998]. 


almost all of the gain occurs in going from 1 to 2 MB, with little additional gain 
beyond that, despite the fact that cache misses are still a cause of significant per- 
formance loss with 2 MB and 4 MB caches. The question is, Why? 

To better understand the answer to this question, we need to determine what 
factors contribute to the L3 miss rate and how they change as the L3 cache grows. 
Fi gure 4.12 shows this data, displaying the number of memory access cycles con- 
tributed per instruction from five sources. The two largest sources of L3 memory 
access cycles with a 1 MB L3 are instruction and capacity/conflict misses. With a 
larger L3 these two sources shrink to be minor contributors. Unfortunately, the 
compulsory, false sharing, and true sharing misses are unaffected by a larger L3. 
Thus, at 4 MB and 8 MB, the true sharing misses generate the dominant fraction 
of the misses; the lack of change in true sharing misses leads to the limited reduc- 
tions in the overall miss rate when increasing the L3 cache size beyond 2 MB. 
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Figure 4.11 The relative performance of the OLTP workload as the size of the L3 
cache, which is set as two-way set associative, grows from 1 MB to 8 MB.The idle time 
also grows as cache size is increased, reducing some of the performance gains. This 
growth occurs because, with fewer memory system stalls, more server processes are 
needed to cover the VO latency.The workload could be retuned to increase the compu- 
tation/communication balance, holding the idle time in check 


Increasing the cache size eliminates most of the uniprocessor misses, while 
leaving the multiprocessor misses untouched. How does increasing the processor 
count affect different types of misses? Figure 4.13 shows this data assuming a 
base configuration with a 2 MB, two-way set associative L3 cache. As we might 
expect, the increase in the true sharing miss rate, which is not compensated for by 
any decrease in the uniprocessor misses, leads to an overall increase in the mem- 
ory access cycles per instruction. 

The final question we examine is whether increasing the block size—which 
should decrease the instruction and cold miss rate and, within limits, also reduce 
the capacity/conflict miss rate and possibly the true sharing miss rate—is helpful 
for this workload. Figure 4.14 shows the number of misses per 1000 instructions 
as the block size is increased from 32 to 256. Increasing the block size from 32 to 
256 affects four of the miss rate components: 


e The true sharing miss rate decreases by more than a factor of 2, indicating 
some locality in the true sharing patterns. 
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Figure 4.12 The contributing causes of memory access cycles shift as the cache size 
is increased.The L3 cache is simulated as two-way set associative. 


e The compulsory miss rate significantly decreases, as we would expect. 


e The conflict/capacity misses show a small decrease (a factor of 1.26 com- 
pared to a factor of 8 increase in block size), indicating that the spatial local- 
ity is not high in the uniprocessor misses that occur with L3 caches larger 
than 2 MB. 


e The false sharing miss rate, although small in absolute terms, nearly doubles. 


The lack of a significant effect on the instruction miss rate is startling. If 
there were an instruction-only cache with this behavior, we would conclude 
that the spatial locality is very poor. In the case of a mixed L2 cache, other 
effects such as instruction-data conflicts may also contribute to the high 
instruction cache miss rate for larger blocks. Other studies have documented 
the low spatial locality in the instruction stream of large database and OLTP 
workloads, which have lots of short basic blocks and special-purpose code 
sequences. Nonetheless, increasing the block size of the third-level cache to 
128 or possibly 256 bytes seems appropriate. 
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Figure 4.13 The contribution to memory access cycles increases as processor count 
increases primarily due to increased true sharing. The compulsory misses slightly 
increase since each processor must now handle more compulsory misses. 


A Multiprogramming and OS Workload 


Our next study is a multiprogrammed workload consisting of both user activity 
and OS activity. The workload used is two independent copies of the compile 
phases of the Andrew benchmark, a benchmark that emulates a software develop- 
ment environment. The compile phase consists of a parallel make using eight 
processors. The workload runs for 5.24 seconds on eight processors, creating 203 
processes and performing 787 disk requests on three different file systems. The 
workload is run with 128 MB of memory, and no paging activity takes place. 

The workload has three distinct phases: compiling the benchmarks, which 
involves substantial compute activity; installing the object files in a library; and 
removing the object files. The last phase is completely dominated by I/O and 
only two processes are active (one for each of the runs). In the middle phase, I/O 
also plays a major role and the processor is largely idle. The overall workload is 
much more system and J/O intensive than the highly tuned commercial workload. 

For the workload measurements, we assume the following memory and I/O 
systems: 
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Figure 4.14 The number of misses per 1000 instructions drops steadily as the block 
size of the L3 cache is increased, making a good case for an L3 block size of at least 
128 bytes.The L3 cache is 2 MB, two-way set associative. 


e Level I instruction cache—32 KB, two-way set associative with a 64-byte 
block, 1 clock cycle hit time. 


e Level 1 data cache—32 KB, two-way set associative with a 32-byte block, 1 
clock cycle hit time. We vary the LI data cache to examine its effect on cache 
behavior. 


e Level 2 cache—1| MB unified, two-way set associative with a 128-byte block, 
hit time 10 clock cycles. 


e Main memory—Single memory on a bus with an access time of 100 clock 
cycles. 


e Disk system—Fixed-access latency of 3 ms (less than normal to reduce idle 
time). 


Figure 4.15 shows how the execution time breaks down for the eight pro- 
cessors using the parameters just listed. Execution time is broken into four 
components: 


1. Idle—Execution in the kernel mode idle loop 
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User Kernel Synchronization CPU idle 
execution execution wait (waiting for I/O) 
% instructions 27 3 1 69 
executed 
% execution time 27 7 2 64 





Figure 4.15 The distribution of execution time in the multiprogrammed parallel 
make workload. The high fraction of idle time is due to disk latency when only one of 
the eight processors is active. These data and the subsequent measurements for this 
workload were collected with the SimOS system [Rosenblum et al. 1995]. The actual 
runs and data collection were done by M. Rosenblum, S. Herrod, and E. Bugnion of 
Stanford University. 


2. User—Execution in user code 
Synchronization—Execution or waiting for synchronization variables 


4. Kernel—Execution in the OS that is neither idle nor in synchronization 
access 


This multiprogramming workload has a significant instruction cache perfor- 
mance loss, at least for the OS. The instruction cache miss rate in the OS for a 64- 
byte block size, two-way set-associative cache varies from 1.7% for a 32 KB 
cache to 0.2% for a 256 KB cache. User-level instruction cache misses are 
roughly one-sixth of the OS rate, across the variety of cache sizes. This partially 
accounts for the fact that although the user code executes nine times as many 
instructions as the kernel, those instructions take only about four times as long as 
the smaller number of instructions executed by the kernel. 


Performance ofthe Multiprogramming and OS Workload 


In this subsection we examine the cache performance of the multiprogrammed 
workload as the cache size and block size are changed. Because of differences 
between the behavior of the kernel and that of the user processes, we keep these 
two components separate. Remember, though, that the user processes execute 
more than eight times as many instructions, so that the overall miss rate is deter- 
mined primarily by the miss rate in user code, which, as we will see, is often one- 
fifth of the kernel miss rate. 

Although the user code executes more instructions, the behavior of the oper- 
ating system can cause more cache misses than the user processes for two reasons 
beyond larger code size and lack of locality. First, the kernel initializes all pages 
before allocating them to a user, which significantly increases the compulsory 
component of the kernel's miss rate. Second, the kernel actually shares data and 
thus has a nontrivial coherence miss rate. In contrast, user processes cause coher- 
ence misses only when the process is scheduled on a different processor, and this 
component of the miss rate is small. 
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Figure 4.16 shows the data miss rate versus data cache size and versus block 
size for the kernel and user components. Increasing the data cache size affects the 
user miss rate more than it affects the kernel miss rate. Increasing the block size 
has beneficial effects for both miss rates, since a larger fraction of the misses 
arise from compulsory and capacity, both of which can be potentially improved 
with larger block sizes. Since coherence misses are relatively rarer, the negative 
effects of increasing block size are small. To understand why the kernel and user 
processes behave differently, we can look at the how the kernel misses behave. 

Figure 4.17 shows the variation in the kernel misses versus increases in cache 
size and in block size. The misses are broken into three classes: compulsory 
misses, coherence misses (from both true and false sharing), and capacity/conflict 
misses (which include misses caused by interference between the OS and the 
user process and between multiple user processes). Figure 4.17 confirms that, for 
the kernel references, increasing the cache size reduces solely the uniprocessor 
capacity/conflict miss rate. In contrast, increasing the block size causes a reduc- 
tion in the compulsory miss rate. The absence of large increases in the coherence 
miss rate as block size is increased means that false sharing effects are probably 
insignificant, although such misses may be offsetting some of the gains from 
reducing the true sharing misses. 

If we examine the number of bytes needed per data reference, as in Figure 
4.18, we see that the kernel has a higher traffic ratio that grows with block size. It 
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Figure 4.16 The data miss rates for the user and kernel components behave differently for increases in the LI 
data cache size (on the left) versus increases in the L1 data cache block size (on the right).Increasing the LI data 
cache from 32 KB to 256 KB (with a 32-byte block) causes the user miss rate to decrease proportionately more than 
the kernel miss rate: the user-level miss rate drops by almost a factor of 3, while the kernel-level miss rate drops only 
by a factor of 1.3.The miss rate for both user and kernel components drops steadily as the L1 block size is increased 
(while keeping the L1 cache at 32 KB). In contrast to the effects of increasing the cache size, increasing the block size 
improves the kernel miss rate more significantly (just under a factor of 4 for the kernel references when going from 
16-byte to 128-byte blocks versus just under a factor of 3 for the user references). 
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Figure 4.17 The components of the kernel data miss rate change as the LI data 
cache size is increased from 32 KB to 256 KB, when the multiprogramming workload 
is run on eight processors.The compulsory miss rate component stays constant, since 
itis unaffected by cache size.The capacity component drops by more than a factor of 2, 
while the coherence component nearly doubles. The increase in coherence misses 
occurs because the probability of a miss being caused by an invalidation increases with 
cache size, since fewer entries are bumped due to capacity. As we would expect, the 
increasing block size of the LI data cache substantially reduces the compulsory miss 
rate in the kernel references. It also has a significant impact on the capacity miss rate, 
decreasing it by a factor of 2.4 over the range of block sizes. The increased block size has 
a small reduction in coherence traffic, which appears to stabilize at 64 bytes, with no 
change in the coherence miss rate in going to 128-byte lines. Because there are not sig- 
nificant reductions in the coherence miss rate as the block size increases, the fraction of 
the miss rate due to coherence grows from about 7% to about 15%. 
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Figure 4.18 The number of bytes needed per data reference grows as block size is 
increased for both the kernel and user components. It is interesting to compare this 
chart against the data on scientific programs shown in Appendix H. 
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is easy to see why this occurs: when going from a 16-byte block to a 128-byte 
block, the miss rate drops by about 3.7, but the number of bytes transferred per 
miss increases by 8, so the total miss traffic increases by just over a factor of 2. 
The user program also more than doubles as the block size goes from 16 to 128 
bytes, but it starts out at a much lower level. 

For the multiprogrammed workload, the OS is a much more demanding user 
of the memory system. If more OS or OS-like activity is included in the work- 
load, and the behavior is similar to what was measured for this workload, it will 
become very difficult to build a sufficiently capable memory system. One possi- 
ble route to improving performance is to make the OS more cache aware, through 
either better programming environments or through programmer assistance. For 
example, the OS reuses memory for requests that arise from different system 
calls. Despite the fact that the reused memory will be completely overwritten, the 
hardware, not recognizing this, will attempt to preserve coherency and the possi- 
bility that some portion of a cache block may be read, even if it is not. This 
behavior is analogous to the reuse of stack locations on procedure invocations. 
The IBM Power series has support to allow the compiler to indicate this type of 
behavior on procedure invocations. It is harder to detect such behavior by the OS, 
and doing so may require programmer assistance, but the payoff is potentially 
even greater. 


Distributed Shared Memory and Directory-Based 
Coherence 


As we saw in Section 4.2, a snooping protocol requires communication with all 
caches on every cache miss, including writes of potentially shared data. The 
absence of any centralized data structure that tracks the state of the caches is both 
the fundamental advantage of a snooping-based scheme, since it allows it to be 
inexpensive, as well as its Achilles’ heel when it comes to scalability. 

For example, with only 16 processors, a block size of 64 bytes, and a 512 KB 
data cache, the total bus bandwidth demand (ignoring stall cycles) for the four 
programs in the scientific/technical workload of Appendix H ranges from about 
4 GB/sec to about 170 GB/sec, assuming a processor that sustains one data refer- 
ence per clock, which for a 4 GHz clock is four data references per ns, which is 
what a 2006 superscalar processor with nonblocking caches might generate. In 
comparison, the memory bandwidth of the highest-performance centralized 
shared-memory 16-way multiprocessor in 2006 was 2.4 GB/sec per processor. In 
2006, multiprocessors with a distributed-memory model are available with over 
12 GB/sec per processor to the nearest memory. 

We can increase the memory bandwidth and interconnection bandwidth by 
distributing the memory, as shown in Figure 4.2 on page 201; this immediately 
separates local memory traffic from remote memory traffic, reducing the band- 
width demands on the memory system and on the interconnection network. 
Unless we eliminate the need for the coherence protocol to broadcast on every 
cache miss, distributing the memory will gain us little. 
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As we mentioned earlier, the alternative to a snoop-based coherence protocol 
is a directory protocol. A directory keeps the state of every block that may be 
cached. Information in the directory includes which caches have copies of the 
block, whether it is dirty, and so on. A directory protocol also can be used to 
reduce the bandwidth demands in a centralized shared-memory machine, as the 
Sun TI design does (see Section 4.8.) We explain a directory protocol as if it 
were implemented with a distributed memory, but the same design also applies to 
a centralized memory organized into banks. 

The simplest directory implementations associate an entry in the directory 
with each memory block. In such implementations, the amount of information is 
proportional to the product of the number of memory blocks (where each block is 
the same size as the level 2 or level 3 cache block) and the number of processors. 
This overhead is not a problem for multiprocessors with less than about 200 pro- 
cessors because the directory overhead with a reasonable block size will be toler- 
able. For larger multiprocessors, we need methods to allow the directory 
structure to be efficiently scaled. The methods that have been used either try to 
keep information for fewer blocks (e.g., only those in caches rather than all mem- 
ory blocks) or try to keep fewer bits per entry by using individual bits to stand for 
a small collection of processors. 

To prevent the directory from becoming the bottleneck, the directory is dis- 
tributed along with the memory (or with the interleaved memory banks in an 
SMP), so that different directory accesses can go to different directories, just as 
different memory requests go to different memories. A distributed directory 
retains the characteristic that the sharing status of a block is always in a single 
known location. This property is what allows the coherence protocol to avoid 
broadcast. Figure 4.19 shows how our distributed-memory multiprocessor looks 
with the directories added to each node. 


Directory-Based Cache Coherence Protocols: The Basics 


lust as with a snooping protocol, there are two primary operations that a directory 
protocol must implement: handling a read miss and handling a write to a shared, 
clean cache block. (Handling a write miss to a block that is currently shared is a 
simple combination of these two.) To implement these operations, a directory 
must track the state of each cache block. In a simple protocol, these states could 
be the following: 


e Shared—One or more processors have the block cached, and the value in 
memory is up to date (as well as in all the caches). 
e Uncached—No processor has a copy of the cache block. 


e Modified—Exactly one processor has a copy of the cache block, and it has 
written the block, so the memory copy is out of date. The processor is called 
the owner of the block. 
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Figure 4.19 A directory is added to each node to implement cache coherence in a 
distributed-memory multiprocessor. Each directory is responsible for tracking the 
caches that share the memory addresses of the portion of memory in the node. The 
directory may communicate with the processor and memory over a common bus, as 
shown, or it may have a separate port to memory, or it may be part of a central node 
controller through which all intranode and intemode communications pass. 


In addition to tracking the state of each potentially shared memory block, we 
must track which processors have copies of that block, since those copies will 
need to be invalidated on a write. The simplest way to do this is to keep a bit vec- 
tor for each memory block. When the block is shared, each bit of the vector indi- 
cates whether the corresponding processor has a copy of that block. We can also 
use the bit vector to keep track of the owner of the block when the block is in the 
exclusive state. For efficiency reasons, we also track the state of each cache block 
at the individual caches. 

The states and transitions for the state machine at each cache are identical to 
what we used for the snooping cache, although the actions on a transition are 
slightly different. The process of invalidating or locating an exclusive copy of a 
data item are different, since they both involve communication between the 
requesting node and the directory and between the directory and one or more 
remote nodes. In a snooping protocol, these two steps are combined through the 
use of a broadcast to all nodes. 

Before we see the protocol state diagrams, it is useful to examine a catalog of 
the message types that may be sent between the processors and the directories for 
the purpose of handling misses and maintaining coherence. Figure 4.20 shows 
the type of messages sent among nodes. The local node is the node where a 
request originates. The home node is the node where the memory location and the 
directory entry of an address reside. The physical address space is statically dis- 
tributed, so the node that contains the memory and directory for a given physical 
address is known. For example, the high-order bits may provide the node number, 
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Message 

Message type Source Destination contents Function of this message 

Read miss local cache home directory P,A Processor P has a read miss at address A; 
request data and make P a read sharer. 

Write miss local cache home directory P,A Processor P has a write miss at address A; 
request data and make P the exclusive owner. 

Invalidate local cache home directory <A Request to send invalidates to all remote caches 
that are caching the block at address A. 

Invalidate home directory remote cache A Invalidate a shared copy of data at address A. 

Fetch home directory remote cache A Fetch the block at address A and send it to its 
home directory; change the state of A in the 
remote cache to shared. 

Fetch/invalidate home directory remote cache A Fetch the block at address A and send it to its 
home directory; invalidate the block in the 
cache. 

Data value reply home directory local cache D Return a data value from the home memory. 

Data write back remote cache home directory A,D Write back a data value for address A. 





Figure 4.20 The possible messages sent among nodes to maintain coherence, along with the source and desti- 
nation node, the contents (where P = requesting processor number, A = requested address, and D = data con- 
tents), and the function of the message. The first three messages are requests sent by the local cache to the home. 
The fourth through sixth messages are messages sent to a remote cache by the home when the home needs the 
data to satisfy a read or write miss request. Data value replies are used to send a value from the home node back to 
the requesting node. Data value write backs occur for two reasons: when a block is replaced in a cache and must be 
written back to its home memory, and also in reply to fetch or fetch/invalidate messages from the home. Writing 
back the data value whenever the block becomes shared simplifies the number of states in the protocol, since any 
dirty block must be exclusive and any shared block is always available in the home memory. 


while the low-order bits provide the offset within the memory on that node. The 
local node may also be the home node. The directory must be accessed when the 
home node is the local node, since copies may exist in yet a third node, called a 
remote node. 

A remote node is the node that has a copy of a cache block, whether exclusive 
(in which case it is the only copy) or shared. A remote node may be the same as 
either the local node or the home node. In such cases, the basic protocol does not 
change, but interprocessor messages may be replaced with intraprocessor 
messages. 

In this section, we assume a simple model of memory consistency. To mini- 
mize the type of messages and the complexity of the protocol, we make an 
assumption that messages will be received and acted upon in the same order they 
are sent. This assumption may not be true in practice and can result in additional 
complications, some of which we address in Section 4.6 when we discuss mem- 
ory consistency models. In this section, we use this assumption to ensure that 
invalidates sent by a processor are honored before new messages are transmitted, 
just as we assumed in the discussion of implementing snooping protocols. As we 
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did in the snooping case, we omit some details necessary to implement the coher- 
ence protocol. In particular, the serialization of writes and knowing that the inval- 
idates for a write have completed are not as simple as in the broadcast-based 
snooping mechanism. Instead, explicit acknowledgements are required in 
response to write misses and invalidate requests. We discuss these issues in more 
detail in Appendix H. 


An Example Directory Protocol 


The basic states of a cache block in a directory-based protocol are exactly like 
those in a snooping protocol, and the states in the directory are also analogous to 
those we showed earlier. Thus we can start with simple state diagrams that show 
the state transitions for an individual cache block and then examine the state dia- 
gram for the directory entry corresponding to each block in memory. As in the 
snooping case, these state transition diagrams do not represent all the details of a 
coherence protocol; however, the actual controller is highly dependent on a num- 
ber of details of the multiprocessor (message delivery properties, buffering struc- 
tures, and so on). In this section we present the basic protocol state diagrams. The 
knotty issues involved in implementing these state transition diagrams are exam- 
ined in Appendix H. 

Figure 4.21 shows the protocol actions to which an individual cache 
responds. We use the same notation as in the last section, with requests coming 
from outside the node in gray and actions in bold. The state transitions for an 
individual cache are caused by read misses, write misses, invalidates, and data 
fetch requests; these operations are all shown in Figure 4.21. An individual cache 
also generates read miss, write miss, and invalidate messages that are sent to the 
home directory. Read and write misses require data value replies, and these 
events wait for replies before changing state. Knowing when invalidates com- 
plete is a separate problem and is handled separately. 

The operation of the state transition diagram for a cache block in Figure 4.21 
is essentially the same as it is for the snooping case: The states are identical, and 
the stimulus is almost identical. The write miss operation, which was broadcast 
on the bus (or other network) in the snooping scheme, is replaced by the data 
fetch and invalidate operations that are selectively sent by the directory control- 
ler. Like the snooping protocol, any cache block must be in the exclusive state 
when it is written, and any shared block must be up to date in memory. 

In a directory-based protocol, the directory implements the other half of the 
coherence protocol. A message sent to a directory causes two different types of 
actions: updating the directory state and sending additional messages to satisfy 
the request. The states in the directory represent the three standard states for a 
block; unlike in a snoopy scheme, however, the directory state indicates the state 
of all the cached copies of a memory block, rather than for a single cache block. 

The memory block may be uncached by any node, cached in multiple nodes 
and readable (shared), or cached exclusively and writable in exactly one node. In 
addition to the state of each block, the directory must track the set of processors 
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Figure 4.21 State transition diagram for an individual cache block in a directory- 
based system. Requests by the local processor are shown in black, and those from the 
home directory are shown in gray. The states are identical to those in the snooping 
case, and the transactions are very similar, with explicit invalidate and write-back 
requests replacing the write misses that were formerly broadcast on the bus. As we did 
for the snooping controller, we assume that an attempt to write a shared cache block is 
treated as a miss; in practice, such a transaction can be treated as an ownership request 
or upgrade request and can deliver ownership without requiring that the cache block 
be fetched. 


that have a copy of a block; we use a set called Sharers to perform this function. 
In multiprocessors with less than 64 nodes (each of which may represent two to 
four times as many processors), this set is typically kept as a bit vector. In larger 
multiprocessors, other techniques are needed. Directory requests need to update 
the set Sharers and also read the set to perform invalidations. 

Figure 4.22 shows the actions taken at the directory in response to messages 
received. The directory receives three different requests: read miss, write miss, 
and data write back. The messages sent in response by the directory are shown in 
bold, while the updating of the set Sharers is shown in bold italics. Because all 
the stimulus messages are external, all actions are shown in gray. Our simplified 
protocol assumes that some actions are atomic, such as requesting a value and 
sending it to another node; a realistic implementation cannot use this assumption. 
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Figure 4.22 The state transition diagram for the directory has the same states and 
structure as the transition diagram for an individual cache. All actions are in gray 
because they are all externally caused. Bold indicates the action taken by the directory 
in response to the request. 


To understand these directory operations, let's examine the requests received 
and actions taken state by state. When a block is in the uncached state, the copy 
in memory is the current value, so the only possible requests for that block are 


e Read miss—The requesting processor is sent the requested data from mem- 
ory, and the requestor is made the only sharing node. The state of the block is 
made shared. 


e Write miss—The requesting processor is sent the value and becomes the shar- 
ing node. The block is made exclusive to indicate that the only valid copy is 
cached. Sharers indicates the identity of the owner. 


When the block is in the shared state, the memory value is up to date, so the same 
two requests can occur: 


e Read miss—The requesting processor is sent the requested data from mem- 
ory, and the requesting processor is added to the sharing set. 


e Write miss—The requesting processor is sent the value. All processors in the 
set Sharers are sent invalidate messages, and the Sharers set is to contain the 
identity of the requesting processor. The state of the block is made exclusive. 
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When the block is in the exclusive state, the current value of the block is held in 
the cache of the processor identified by the set Sharers (the owner), so there are 
three possible directory requests: 


e Read miss—The owner processor is sent a data fetch message, which causes 
the state of the block in the owner's cache to transition to shared and causes 
the owner to send the data to the directory, where it is written to memory and 
sent back to the requesting processor. The identity of the requesting processor 
is added to the set Sharers, which still contains the identity of the processor 
that was the owner (since it still has a readable copy). 


e Data write back—The owner processor is replacing the block and therefore 
must write it back. This write back makes the memory copy up to date (the 
home directory essentially becomes the owner), the block is now uncached, 
and the Sharers set is empty. 


e Write miss—The block has a new owner. A message is sent to the old owner, 
causing the cache to invalidate the block and send the value to the directory, 
from which it is sent to the requesting processor, which becomes the new 
owner. Sharers is set to the identity of the new owner, and the state of the 
block remains exclusive. 


This state transition diagram in Figure 4.22 is a simplification, just as it was 
in the snooping cache case. In the case of a directory, as well as a snooping 
scheme implemented with a network other than a bus, our protocols will need 
to deal with nonatomic memory transactions. Appendix H explores these issues 
in depth. 

The directory protocols used in real multiprocessors contain additional opti- 
mizations. In particular, in this protocol when a read or write miss occurs for a 
block that is exclusive, the block is first sent to the directory at the home node. 
From there it is stored into the home memory and also sent to the original 
requesting node. Many of the protocols in use in commercial multiprocessors for- 
ward the data from the owner node to the requesting node directly (as well as per- 
forming the write back to the home). Such optimizations often add complexity by 
increasing the possibility of deadlock and by increasing the types of messages 
that must be handled. 

Implementing a directory scheme requires solving most of the same chal- 
lenges we discussed for snoopy protocols beginning on page 217. There are, 
however, new and additional problems, which we describe in Appendix H. 


Synchronization: The Basics 


Synchronization mechanisms are typically built with user-level software routines 
that rely on hardware-supplied synchronization instructions. For smaller multi- 
processors or low-contention situations, the key hardware capability is an unin- 
terruptible instruction or instruction sequence capable of atomically retrieving 
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and changing a value. Software synchronization mechanisms are then con- 
structed using this capability. In this section we focus on the implementation of 
lock and unlock synchronization operations. Lock and unlock can be used 
straightforwardly to create mutual exclusion, as well as to implement more com- 
plex synchronization mechanisms. 

In larger-scale multiprocessors or high-contention situations, synchronization 
can become a performance bottleneck because contention introduces additional 
delays and because latency is potentially greater in such a multiprocessor. We dis- 
cuss how the basic synchronization mechanisms of this section can be extended 
for large processor counts in Appendix H. 


Basic Hardware Primitives 


The key ability we require to implement synchronization in a multiprocessor is a 
set of hardware primitives with the ability to atomically read and modify a mem- 
ory location. Without such a capability, the cost of building basic synchronization 
primitives will be too high and will increase as the processor count increases. 
There are a number of alternative formulations of the basic hardware primitives, 
all of which provide the ability to atomically read and modify a location, together 
with some way to tell if the read and write were performed atomically. These 
hardware primitives are the basic building blocks that are used to build a wide 
variety of user-level synchronization operations, including things such as locks 
and barriers. In general, architects do not expect users to employ the basic hard- 
ware primitives, but instead expect that the primitives will be used by system pro- 
grammers to build a synchronization library, a process that is often complex and 
tricky. Let's start with one such hardware primitive and show how it can be used 
to build some basic synchronization operations. 

One typical operation for building synchronization operations is the atomic 
exchange, which interchanges a value in a register for a value in memory. To see 
how to use this to build a basic synchronization operation, assume that we want 
to build a simple lock where the value O is used to indicate that the lock is free 
and | is used to indicate that the lock is unavailable. A processor tries to set the 
lock by doing an exchange of 1, which is in a register, with the memory address 
corresponding to the lock. The value returned from the exchange instruction is 1 
if some other processor had already claimed access and 0 otherwise. In the latter 
case, the value is also changed to 1, preventing any competing exchange from 
also retrieving a 0. 

For example, consider two processors that each try to do the exchange simul- 
taneously: This race is broken since exactly one of the processors will perform 
the exchange first, returning 0, and the second processor will return 1 when it 
does the exchange. The key to using the exchange (or swap) primitive to imple- 
ment synchronization is that the operation is atomic: The exchange is indivisible, 
and two simultaneous exchanges will be ordered by the write serialization mech- 
anisms. It is impossible for two processors trying to set the synchronization vari- 
able in this manner to both think they have simultaneously set the variable. 


4.5 Synchronization: The Basics * 239 


There are a number of other atomic primitives that can be used to implement 
synchronization. They all have the key property that they read and update a mem- 
ory value in such a manner that we can tell whether or not the two operations 
executed atomically. One operation, present in many older multiprocessors, is 
test-and-set, which tests a value and sets it if the value passes the test. For exam- 
ple, we could define an operation that tested for O and set the value to 1, which 
can be used in a fashion similar to how we used atomic exchange. Another atomic 
synchronization primitive is fetch-and-increment: It returns the value of a mem- 
ory location and atomically increments it. By using the value 0 to indicate that 
the synchronization variable is unclaimed, we can use fetch-and-increment, just 
as we used exchange. There are other uses of operations like fetch-and- 
increment, which we will see shortly. 

Implementing a single atomic memory operation introduces some challenges, 
since it requires both a memory read and a write in a single, uninterruptible 
instruction. This requirement complicates the implementation of coherence, since 
the hardware cannot allow any other operations between the read and the write, 
and yet must not deadlock. 

An alternative is to have a pair of instructions where the second instruction 
returns a value from which it can be deduced whether the pair of instructions was 
executed as if the instructions were atomic. The pair of instructions is effectively 
atomic if it appears as if all other operations executed by any processor occurred 
before or after the pair. Thus, when an instruction pair is effectively atomic, no 
other processor can change the value between the instruction pair. 

The pair of instructions includes a special load called a load linked or load 
locked and a special store called a store conditional. These instructions are used 
in sequence: If the contents of the memory location specified by the load linked 
are changed before the store conditional to the same address occurs, then the 
store conditional fails. If the processor does a context switch between the two 
instructions, then the store conditional also fails. The store conditional is defined 
to return 1 if it was successful and a 0 otherwise. Since the load linked returns the 
initial value and the store conditional returns 1 only if it succeeds, the following 
sequence implements an atomic exchange on the memory location specified by 
the contents of R1: 


try: MOV R3,R4 mov exchange value 
LL R2,0(R1) sload linked 
æ R3,0(R1) ;store conditional 
BEQZ R3,try ;branch store fails 
MV R4.R2 ;put load value in R4 


At the end of this sequence the contents of R4 and the memory location speci- 
fied by RI have been atomically exchanged (ignoring any effect from delayed 
branches). Any time a processor intervenes and modifies the value in memory 
between the LL and SC instructions, the SC returns 0 in R3, causing the code 
sequence to try again. 
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An advantage of the load linked/store conditional mechanism is that it can be 
used to build other synchronization primitives. For example, here is an atomic 
fetch-and-increment: 


try: LL R2,0(R1) ;sload linked 
DADDUI R3,R2,#1  ;i increment 
SC R3,0(R1) ;store conditional 
BEQZ R3,try sbranch store fails 


These instructions are typically implemented by keeping track of the address 
specified in the LL instruction in a register, often called the link register. If an 
interrupt occurs, or if the cache block matching the address in the link register is 
invalidated (for example, by another SC), the link register is cleared. The SC 
instruction simply checks that its address matches that in the link register. If so, 
the SC succeeds; otherwise, it fails. Since the store conditional will fail after 
either another attempted store to the load linked address or any exception, care 
must be taken in choosing what instructions are inserted between the two instruc- 
tions. In particular, only register-register instructions can safely be permitted; 
otherwise, it is possible to create deadlock situations where the processor can 
never complete the SC. In addition, the number of instructions between the load 
linked and the store conditional should be small to minimize the probability that 
either an unrelated event or a competing processor causes the store conditional to 
fail frequently. 


Implementing Locks Using Coherence 


Once we have an atomic operation, we can use the coherence mechanisms of a 
multiprocessor to implement spin locks—locks that a processor continuously tries 
to acquire, spinning around a loop until it succeeds. Spin locks are used when 
programmers expect the lock to be held for a very short amount of time and when 
they want the process of locking to be low latency when the lock is available. 
Because spin locks tie up the processor, waiting in a loop for the lock to become 
free, they are inappropriate in some circumstances. 

The simplest implementation, which we would use if there were no cache 
coherence, would keep the lock variables in memory. A processor could continu- 
ally try to acquire the lock using an atomic operation, say, exchange, and test 
whether the exchange returned the lock as free. To release the lock, the processor 
simply stores the value 0 to the lock. Here is the code sequence to lock a spin 
lock whose address is in RI using an atomic exchange: 


DADDUI R2,R0,#1 
lockit: © EXCH R2,0(R1) satomic exchange 
BNEZ R2,lockit ;already locked? 
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If our multiprocessor supports cache coherence, we can cache the locks using 
the coherence mechanism to maintain the lock value coherently. Caching locks 
has two advantages. First, it allows an implementation where the process of 
"spinning" (trying to test and acquire the lock in a tight loop) could be done on a 
local cached copy rather than requiring a global memory access on each attempt 
to acquire the lock. The second advantage comes from the observation that there 
is often locality in lock accesses: that is, the processor that used the lock last will 
use it again in the near future. In such cases, the lock value may reside in the 
cache of that processor, greatly reducing the time to acquire the lock. 

Obtaining the first advantage—being able to spin on a local cached copy 
rather than generating a memory request for each attempt to acquire the lock— 
requires a change in our simple spin procedure. Each attempt to exchange in the 
loop directly above requires a write operation. If multiple processors are attempt- 
ing to get the lock, each will generate the write. Most of these writes will lead to 
write misses, since each processor is trying to obtain the lock variable in an 
exclusive state. 

Thus, we should modify our spin lock procedure so that it spins by doing 
reads on a local copy of the lock until it successfully sees that the lock is avail- 
able. Then it attempts to acquire the lock by doing a swap operation. A processor 
first reads the lock variable to test its state. A processor keeps reading and testing 
until the value of the read indicates that the lock is unlocked. The processor then 
races against all other processes that were similarly "spin waiting" to see who can 
lock the variable first. All processes use a swap instruction that reads the old 
value and stores a 1 into the lock variable. The single winner will see the 0, and 
the losers will see a 1 that was placed there by the winner. (The losers will con- 
tinue to set the variable to the locked value, but that doesn't matter.) The winning 
processor executes the code after the lock and, when finished, stores a O into the 
lock variable to release the lock, which starts the race all over again. Here is the 
code to perform this spin lock (remember that 0 is unlocked and 1 is locked): 


lockit: LD R2,0(R1) sload of lock 
BNEZ R2,lockit snot available-spin 
DADDUI R2,R0,#1 sload locked value 
EXCH R2,0(R1) ;swap 


BNEZ R2,lockit sbranch if lock wasn't 0 


Let's examine how this "spin lock" scheme uses the cache coherence mecha- 
nisms. Figure 4.23 shows the processor and bus or directory operations for multi- 
ple processes trying to lock a variable using an atomic swap. Once the processor 
with the lock stores a 0 into the lock, all other caches are invalidated and must 
fetch the new value to update their copy of the lock. One such cache gets the copy 
of the unlocked value (0) first and performs the swap. When the cache miss of 
other processors is satisfied, they find that the variable is already locked, so they 
must return to testing and spinning. 
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Coherence 
Step Processor PO Processor PI Processor P2 state of lock Bus/directory activity 
1 Has lock Spins, testing if Spins, testing if Shared None 
lock = 0 lock = 0 
2 Set lock to 0 (Invalidate received) (Invalidate received) Exclusive (PO) Write invalidate of lock 
variable from PO 
3 Cache miss Cache miss Shared Bus/directory services P2 
cache miss; write back 
from PO 
4 (Waits while bus/ Lock = 0 Shared Cache miss for P2 satisfied 
directory busy) 
5 Lock = 0 Executes swap, gets Shared Cache miss for PI satisfied 
cache miss 
6 Executes swap, Completes swap: Exclusive (P2) Bus/directory services P2 
gets cache miss returns 0 and sets cache miss; generates 
Lock = 1 invalidate 
7 Swap completes and Enter critical section Exclusive (PI) Bus/directory services PI 
returns 1, and sets cache miss; generates write 
Lock = 1 back 
8 Spins, testing if None 
lock = 0 





Figure 4.23 Cache coherence steps and bus traffic for three processors, PO, PI, and P2. This figure assumes write 
invalidate coherence. PO starts with the lock (step 1). PO exits and unlocks the lock (step 2). PI and P2 race to see 
which reads the unlocked value during the swap (steps 3-5). P2 wins and enters the critical section (steps 6 and 7), 
while Pi's attempt fails so it starts spin waiting (steps 7 and 8). In a real system, these events will take many more 
than 8 clock ticks, since acquiring the bus and replying to misses takes much longer. 


This example shows another advantage of the load linked/store conditional 
primitives: The read and write operations are explicitly separated. The load 
linked need not cause any bus traffic. This fact allows the following simple code 
sequence, which has the same characteristics as the optimized version using 
exchange (RI has the address of the lock, the LL has replaced the LD, and the SC 
has replaced the EXCH): 


lockit: LL R2,0(R1) ;load linked 
BNEZ R2,lockit snot available-spin 
DADDUI ~=R2,,RO, #1 slocked value 
SC R2,0(R1) ;store 
BEQZ R2,lockit soranch if store fails 


The first branch forms the spinning loop; the second branch resolves races when 
two processors see the lock available simultaneously. 

Although our spin lock scheme is simple and compelling, it has difficulty 
scaling up to handle many processors because of the communication traffic gen- 
erated when the lock is released. We address this issue and other issues for larger 


processor counts in Appendix H. 
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Models of Memory Consistency: An Introduction 


Cache coherence ensures that multiple processors see a consistent view of mem- 
ory. It does not answer the question of how consistent the view of memory must 
be. By "how consistent" we mean, when must a processor see a value that has 
been updated by another processor? Since processors communicate through 
shared variables (used both for data values and for synchronization), the question 
boils down to this: In what order must a processor observe the data writes of 
another processor? Since the only way to "observe the writes of another proces- 
sor" is through reads, the question becomes, What properties must be enforced 
among reads and writes to different locations by different processors? 

Although the question of how consistent memory must be seems simple, it is 
remarkably complicated, as we can see with a simple example. Here are two code 
segments from processes P1 and P2, shown side by side: 


P1: A = 0; P2: B = 0; 
A= 1; B= 3 
L1: if B = 0) ... L2: if (A = 0)... 


Assume that the processes are running on different processors, and that locations 
A and B are originally cached by both processors with the initial value of 0. If 
writes always take immediate effect and are immediately seen by other proces- 
sors, it will be impossible for both if statements (labeled L1 and L2) to evaluate 
their conditions as true, since reaching the if statement means that either A or B 
must have been assigned the value 1. But suppose the write invalidate is delayed, 
and the processor is allowed to continue during this delay; then it is possible that 
both P1 and P2 have not seen the invalidations for B and A (respectively) before 
they attempt to read the values. The question is, Should this behavior be allowed, 
and if so, under what conditions? 

The most straightforward model for memory consistency is called sequential 
consistency. Sequential consistency requires that the result of any execution be 
the same as if the memory accesses executed by each processor were kept in 
order and the accesses among different processors were arbitrarily interleaved. 
Sequential consistency eliminates the possibility of some nonobvious execution 
in the previous example because the assignments must be completed before the if 
statements are initiated. 

The simplest way to implement sequential consistency is to require a proces- 
sor to delay the completion of any memory access until all the invalidations 
caused by that access are completed. Of course, it is equally effective to delay the 
next memory access until the previous one is completed. Remember that memory 
consistency involves operations among different variables: the two accesses that 
must be ordered are actually to different memory locations. In our example, we 
must delay the read of A or B (A == 0 or B == 0) until the previous write has com- 
pleted (B = 1 or A = 1). Under sequential consistency, we cannot, for example, 
simply place the write in a write buffer and continue with the read. 
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Example 


Answer 


Although sequential consistency presents a simple programming paradigm, it 
reduces potential performance, especially in a multiprocessor with a large num- 
ber of processors or long interconnect delays, as we can see in the following 
example. 


Suppose we have a processor where a write miss takes 50 cycles to establish 
ownership, 10 cycles to issue each invalidate after ownership is established, and 
80 cycles for an invalidate to complete and be acknowledged once it is issued. 
Assuming that four other processors share a cache block, how long does a write 
miss stall the writing processor if the processor is sequentially consistent? 
Assume that the invalidates must be explicitly acknowledged before the coher- 
ence controller knows they are completed. Suppose we could continue executing 
after obtaining ownership for the write miss without waiting for the invalidates; 
how long would the write take? 


When we wait for invalidates, each write takes the sum of the ownership time 
plus the time to complete the invalidates. Since the invalidates can overlap, we 
need only worry about the last one, which starts 10+ 10+ 10+ 10 = 40 cycles 
after ownership is established. Hence the total time for the write is 50 + 40 + 80 = 
170 cycles. In comparison, the ownership time is only 50 cycles. With appropri- 
ate write buffer implementations, it is even possible to continue before ownership 
is established. 


To provide better performance, researchers and architects have explored two 
different routes. First, they developed ambitious implementations that preserve 
sequential consistency but use latency-hiding techniques to reduce the penalty; 
we discuss these in Section 4.7. Second, they developed less restrictive memory 
consistency models that allow for faster hardware. Such models can affect how 
the programmer sees the multiprocessor, so before we discuss these less restric- 
tive models, let's look at what the programmer expects. 


The Programmer's View 


Although the sequential consistency model has a performance disadvantage, 
from the viewpoint of the programmer it has the advantage of simplicity. The 
challenge is to develop a programming model that is simple to explain and yet 
allows a high-performance implementation. 

One such programming model that allows us to have a more efficient imple- 
mentation is to assume that programs are synchronized. A program is synchro- 
nized if all access to shared data are ordered by synchronization operations. A 
data reference is ordered by a synchronization operation if, in every possible exe- 
cution, a write of a variable by one processor and an access (either a read or a 
write) of that variable by another processor are separated by a pair of synchroni- 
zation operations, one executed after the write by the writing processor and one 
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executed before the access by the second processor. Cases where variables may 
be updated without ordering by synchronization are called data races because the 
execution outcome depends on the relative speed of the processors, and like races 
in hardware design, the outcome is unpredictable, which leads to another name 
for synchronized programs: data-race-free. 

As a simple example, consider a variable being read and updated by two dif- 
ferent processors. Each processor surrounds the read and update with a lock and 
an unlock, both to ensure mutual exclusion for the update and to ensure that the 
read is consistent. Clearly, every write is now separated from a read by the other 
processor by a pair of synchronization operations: one unlock (after the write) 
and one lock (before the read). Of course, if two processors are writing a variable 
with no intervening reads, then the writes must also be separated by synchroniza- 
tion operations. 

It is a broadly accepted observation that most programs are synchronized. 
This observation is true primarily because if the accesses were unsynchronized, 
the behavior of the program would likely be unpredictable because the speed of 
execution would determine which processor won a data race and thus affect the 
results of the program. Even with sequential consistency, reasoning about such 
programs is very difficult. 

Programmers could attempt to guarantee ordering by constructing their own 
synchronization mechanisms, but this is extremely tricky, can lead to buggy pro- 
grams, and may not be supported architecturally, meaning that they may not 
work in future generations of the multiprocessor. Instead, almost all program- 
mers will choose to use synchronization libraries that are correct and optimized 
for the multiprocessor and the type of synchronization. 

Finally, the use of standard synchronization primitives ensures that even if the 
architecture implements a more relaxed consistency model than sequential con- 
sistency, a synchronized program will behave as if the hardware implemented 
sequential consistency. 


Relaxed Consistency Models: The Basics 


The key idea in relaxed consistency models is to allow reads and writes to com- 
plete out of order, but to use synchronization operations to enforce ordering, so 
that a synchronized program behaves as if the processor were sequentially con- 
sistent. There are a variety of relaxed models that are classified according to what 
read and write orderings they relax. We specify the orderings by a set of rules of 
the form X—>Y, meaning that operation X must complete before operation Y is 
done. Sequential consistency requires maintaining all four possible orderings: 
R—W, R-»R, W—R, and W—wW. The relaxed models are defined by which of 
these four sets of orderings they relax: 


1. Relaxing the W—R ordering yields a model known as total store ordering or 
processor consistency. Because this ordering retains ordering among writes, 
many programs that operate under sequential consistency operate under this 
model, without additional synchronization. 
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2. Relaxing the W—wW ordering yields a model known as partial store order. 


3. Relaxing the RW and R—R orderings yields a variety of models including 
weak ordering, the PowerPC consistency model, and release consistency, 
depending on the details of the ordering restrictions and how synchronization 
operations enforce ordering. 


By relaxing these orderings, the processor can possibly obtain significant perfor- 
mance advantages. There are, however, many complexities in describing relaxed 
consistency models, including the advantages and complexities of relaxing dif- 
ferent orders, defining precisely what it means for a write to complete, and decid- 
ing when processors can see values that the processor itself has written. For more 
information about the complexities, implementation issues, and performance 
potential from relaxed models, we highly recommend the excellent tutorial by 
Adve and Gharachorloo [1996]. 


Final Remarks on Consistency Models 


At the present time, many multiprocessors being built support some sort of 
relaxed consistency model, varying from processor consistency to release consis- 
tency. Since synchronization is highly multiprocessor specific and error prone, 
the expectation is that most programmers will use standard synchronization 
libraries and will write synchronized programs, making the choice of a weak con- 
sistency model invisible to the programmer and yielding higher performance. 

An alternative viewpoint, which we discuss more extensively in the next sec- 
tion, argues that with speculation much of the performance advantage of relaxed 
consistency models can be obtained with sequential or processor consistency. 

A key part of this argument in favor of relaxed consistency revolves around 
the role of the compiler and its ability to optimize memory access to potentially 
shared variables; this topic is also discussed in the next section. 


Crosscutting Issues 


Because multiprocessors redefine many system characteristics (e.g., performance 
assessment, memory latency, and the importance of scalability), they introduce 
interesting design problems that cut across the spectrum, affecting both hardware 
and software. In this section we give several examples related to the issue of 
memory consistency. 


Compiler Optimization and the Consistency Model 


Another reason for defining a model for memory consistency is to specify the 
range of legal compiler optimizations that can be performed on shared data. In 
explicitly parallel programs, unless the synchronization points are clearly 
defined and the programs are synchronized, the compiler could not interchange 
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a read and a write of two different shared data items because such transforma- 
tions might affect the semantics of the program. This prevents even relatively 
simple optimizations, such as register allocation of shared data, because such a 
process usually interchanges reads and writes. In implicitly parallelized pro- 
grams—for example, those written in High Performance FORTRAN (HPF)— 
programs must be synchronized and the synchronization points are known, so 
this issue does not arise. 


Using Speculation to Hide Latency in Strict Consistency Models 


As we saw in Chapter 2, speculation can be used to hide memory latency. It can 
also be used to hide latency arising from a strict consistency model, giving much 
of the benefit of a relaxed memory model. The key idea is for the processor to' use 
dynamic scheduling to reorder memory references, letting them possibly execute 
out of order. Executing the memory references out of order may generate viola- 
tions of sequential consistency, which might affect the execution of the program. 
This possibility is avoided by using the delayed commit feature of a speculative 
processor. Assume the coherency protocol is based on invalidation. If the proces- 
sor receives an invalidation for a memory reference before the memory reference 
is committed, the processor uses speculation recovery to back out the computa- 
tion and restart with the memory reference whose address was invalidated. 

If the reordering of memory requests by the processor yields an execution 
order that could result in an outcome that differs from what would have been seen 
under sequential consistency, the processor will redo the execution. The key to 
using this approach is that the processor need only guarantee that the result 
would be the same as if all accesses were completed in order, and it can achieve 
this by detecting when the results might differ. The approach is attractive because 
the speculative restart will rarely be triggered. It will only be triggered when 
there are unsynchronized accesses that actually cause a race [Gharachorloo, 
Gupta, andHennessy 1992]. 

Hill [1998] advocates the combination of sequential or processor consistency 
together with speculative execution as the consistency model of choice. His argu- 
ment has three parts. First, an aggressive implementation of either sequential 
consistency or processor consistency will gain most of the advantage of a more 
relaxed model. Second, such an implementation adds very little to the implemen- 
tation cost of a speculative processor. Third, such an approach allows the pro- 
grammer to reason using the simpler programming models of either sequential or 
processor consistency. 

The MIPS R10000 design team had this insight in the mid-1990s and used 
the RIOOOO's out-of-order capability to support this type of aggressive imple- 
mentation of sequential consistency. Hill's arguments are likely to motivate oth- 
ers to follow this approach. 

One open question is how successful compiler technology will be in optimiz- 
ing memory references to shared variables. The state of optimization technology 
and the fact that shared data are often accessed via pointers or array indexing 
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Example 


Answer 


have limited the use of such optimizations. If this technology became available 
and led to significant performance advantages, compiler writers would want to be 
able to take advantage of a more relaxed programming model. 


Inclusion and Its Implementation 


All multiprocessors use multilevel cache hierarchies to reduce both the demand 
on the global interconnect and the latency of cache misses. If the cache also pro- 
vides multilevel inclusion—every level of cache hierarchy is a subset of the level 
further away from the processor—then we can use the multilevel structure to re- 
duce the contention between coherence traffic and processor traffic that occurs 
when snoops and processor cache accesses must contend for the cache. Many 
multiprocessors with multilevel caches enforce the inclusion property, although 
recent multiprocessors with smaller LI caches and different block sizes have 
sometimes chosen not to enforce inclusion. This restriction is also called the sub- 
set property because each cache is a subset of the cache below it in the hierarchy. 

At first glance, preserving the multilevel inclusion property seems trivial. 
Consider a two-level example: any miss in LI either hits in L2 or generates a 
miss in L2, causing it to be brought into both LI and L2. Likewise, any invalidate 
that hits in L2 must be sent to LI, where it will cause the block to be invalidated 
if it exists. 

The catch is what happens when the block sizes of LI and L2 are different. 
Choosing different block sizes is quite reasonable, since L2 will be much larger 
and have a much longer latency component in its miss penalty, and thus will want 
to use a larger block size. What happens to our "automatic" enforcement of inclu- 
sion when the block sizes differ? A block in L2 represents multiple blocks in LI, 
and a miss in L2 causes the replacement of data that is equivalent to multiple LI 
blocks. For example, if the block size of L2 is four times that of LI, then a miss 
in L2 will replace the equivalent of four LI blocks. Let's consider a detailed 
example. 


Assume that L2 has a block size four times that of LI. Show how a miss for an 
address that causes a replacement in LI and L2 can lead to violation of the inclu- 
sion property. 


Assume that LI and L2 are direct mapped and that the block size of LI is b bytes 
and the block size of L2 is 4b bytes. Suppose LI contains two blocks with start- 
ing addresses x and x + b and thatx mod 4b = 0, meaning that x also is the starting 
address of a block in L2; then that single block in L2 contains the LI blocks x, x 
+ b, x + 2b, and x + 3b. Suppose the processor generates a reference to block y 
that maps to the block containing x in both caches and hence misses. Since L2 
missed, it fetches 4b bytes and replaces the block containing x, x + b, x + 2b, and 
x + 3b, while LI takes b bytes and replaces the block containing x. Since LI still 
contains x + b, but L2 does not, the inclusion property no longer holds. 


4.8 
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To maintain inclusion with multiple block sizes, we must probe the higher 
levels of the hierarchy when a replacement is done at the lower level to ensure 
that any words replaced in the lower level are invalidated in the higher-level 
caches; different levels of associativity create the same sort of problems. In 2006, 
designers appear to be split on the enforcement of inclusion. Baer and Wang 
[1988] describe the advantages and challenges of inclusion in detail. 


Putting It All Together:The Sun T1 Multiprocessor 


TI is a multicore multiprocessor introduced by Sun in 2005 as a server processor. 
What makes Tl especially interesting is that it is almost totally focused on 
exploiting thread-level parallelism (TLP) rather than instruction-level parallelism 
(ILP). Indeed, it is the only single-issue desktop or server microprocessor intro- 
duced in more than five years. Instead of focusing on ILP, T1 puts all its attention 
on TLP, using both multiple cores and multithreading to produce throughput. 

Each Tl processor contains eight processor cores, each supporting four 
threads. Each processor core consists of a simple six-stage, single-issue pipeline 
(a standard five-stage RISC pipeline like that of Appendix A, with one stage 
added for thread switching). Tl uses fine-grained multithreading, switching to a 
new thread on each clock cycle, and threads that are idle because they are waiting 
due to a pipeline delay or cache miss are bypassed in the scheduling. The proces- 
sor is idle only when all four threads are idle or stalled. Both loads and branches 
incur a 3-cycle delay that can only be hidden by other threads. A single set of 
floating-point functional units is shared by all eight cores, as floating-point per- 
formance was not a focus for T1. 

Figure 4.24 shows the organization of the Tl processor. The cores access four 
level 2 caches via a crossbar switch, which also provides access to the shared 
floating-point unit. Coherency is enforced among the LI caches by a directory 
associated with each L2 cache block. The directory operates analogously to those 
we discussed in Section 4.4, but is used to track which LI caches have copies of 
an L2 block. By associating each L2 cache with a particular memory bank and 
enforcing the subset property, Tl can place the directory at L2 rather than at the 
memory, which reduces the directory overhead. Because the LI data cache is 
write through, only invalidation messages are required; the data can always be 
retrieved from the L2 cache. 

Figure 4.25 summarizes the Tl processor. 


T1 Performance 


We look at the performance of Tl using three server-oriented benchmarks: TPC- 
C, SPECJBB (the SPEC Java Business Benchmark), and SPECWeb99. The 
SPECWeb99 benchmark is run on a four-core version of T] because it cannot 
scale to use the full 32 threads of an eight-core processor; the other two bench- 
marks are run with eight cores and 4 threads each for a total of 32 threads. 
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Figure 4.24 TheTI processor. Each core supports four threads and has its own level 1 
caches (16 KB for instructions and 8 KB for data). The level 2 caches total 3 MB and are 
effectively 12-way associative.The caches are interleaved by 64-byte cache lines. 


We begin by looking at the effect of multithreading on the performance of the 
memory system when running in single-threaded versus multithreaded mode. 
Figure 4.26 shows the relative increase in the miss rate and the observed miss 
latency when executing with 1 thread per core versus executing 4 threads per core 
for TPC-C. Both the miss rates and the miss latencies increase, due to increased 
contention in the memory system. The relatively small increase in miss latency 
indicates that the memory system still has unused capacity. 

As we demonstrated in the previous section, the performance of multiproces- 
sor workloads depends intimately on the memory system and the interaction with 
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Characteristic SunTI 





Multiprocessor and Eight cores per chip; four threads per core. Fine-grained thread 








multithreading scheduling. One shared floating-point unit for eight cores. 

support Supports only on-chip multiprocessing. 

Pipeline structure Simple, in-order, six-deep pipeline with 3-cycle delays for loads 
and branches. 

LI caches 16 KB instructions; 8 KB data. 64-byte block size. Miss to L2 is 


23 cycles, assuming no contention. 





L2 caches Four separate L2 caches, each 750 KB and associated with a 
memory bank. 64-byte block size. Miss to main memory is 110 
clock cycles assuming no contention. 





Initial implementation 90 nm process; maximum clock rate of 1.2 GHz; power 79 W; 
300M transistors, 379 mm? die. 





Figure 4.25 A summary of the TI processor. 
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Figure 4.26 The relative change in the miss rates and miss latencies when executing 
with 1 thread per core versus 4 threads per core on the TPC-C benchmark.The laten- 
cies are the actual time to return the requested data after a miss. In the 4-thread case, 
the execution of other threads could potentially hide much ofthis latency. 


the application. For Tl both the L2 cache size and the block size are key parame- 
ters. Figure 4.27 shows the effect on miss rates from varying the L2 cache size by 
a factor of 2 from the base of 3 MB and by reducing the block size to 32 bytes. 
The data clearly show a significant advantage of a 3 MB L2 versus a 1.5 MB; fur- 
ther improvements can be gained from a 6 MB L2. As we can see, the choice of a 
64-byte block size reduces the miss rate but by considerably less than a factor of 
2. Hence, using the larger block size Tl generates more traffic to the memories. 
Whether this has a significant performance impact depends on the characteristics 
of the memory system. 
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Figure 4.27 Change in the L2 miss rate with variation in cache size and block size. 
Both TPC-C and SPECJBB are run with all eight cores and four threads per core. Recall 
thatT| has a 3 MB L2 with 64-byte lines. 
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Figure 4.28 The change in the miss latency of the L2 cache as the cache size and 
block size are changed. Although TPC-C has a significantly higher miss rate, its miss 
penalty is only slightly higher. This is because SPECJBB has a much higher dirty miss 
rate, requiring L2 cache lines to be written back with high frequency. Recall thatTI has 
a 3 MB L2 with 64-byte lines. 


As we mentioned earlier, there is some contention at the memory from multi- 
ple threads. How do the cache size and block size affect the contention at the 
memory system? Figure 4.28 shows the effect on the L2 cache miss latency under 
the same variations as we saw in Figure 4.27. As we can see, for either a 3 MB or 
6 MB cache, the larger block size results in a smaller L2 cache miss time. How 
can this be if the miss rate changes much less than a factor of 2? As we will see in 
more detail in the next chapter, modern DRAMs provide a block of data for only 
slightly more time than needed to provide a single word; thus, the miss penalty 
for the 32-byte block is only slightly less than the 64-byte block. 
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Overall Performance 


Figure 4.29 shows the per-thread and per-core CPI, as well as the effective 
instructions per clock (IPC) for the eight-processor chip. Because TI is a fine- 
grained multithreaded processor with four threads per core, with sufficient paral- 
lelism the ideal effective CPI per thread would be 4, since that would mean that 
each thread was consuming one cycle out of every four. The ideal CPI per core 
would be 1. The effective IPC for Tl is simply 8 divided by the per-core CPI. 

At first glance, one might react that TI is not very efficient, since the effective 
throughout is between 56% and 7 1% of the ideal on these three benchmarks. But, 
consider the comparative performance of a wide-issue superscalar. Processors 
such as the Itanium 2 (higher transistor count, much higher power, comparable 
silicon area) would need to achieve incredible instruction throughput sustaining 
4.5-5.7 instructions per clock, well more than double the acknowledged IPC. It 
appears quite clear that, at least for integer-oriented server applications with 
thread-level parallelism, a multicore approach is a much better alternative than a 
single very wide issue processor. The next subsection offers some performance 
comparisons among multicore processors. 

By looking at the behavior of an average thread, we can understand the inter- 
action between multithreading and parallel processing. Figure 4.30 shows the 
percentage of cycles for which a thread is executing, ready but not executing, and 
not ready. Remember that not ready does not imply that the core with that thread 
is stalled; itis only when all four threads are not ready that the core will stall. 

Threads can be not ready due to cache misses, pipeline delays (arising from 
long latency instructions such as branches, loads, floating point, or integer multi- 
ply/divide), and a variety of smaller effects. Figure 4.31 shows the relative fre- 
quency of these various causes. Cache effects are responsible for the thread not 
being ready from 50% to 75% of the time, with LI instruction misses, LI data 
misses, and L2 misses contributing roughly equally. Potential delays from the 
pipeline (called "pipeline delay") are most severe in SPECJBB and may arise 
from its higher branch frequency. 














Benchmark Per-thread CPI Per core CPI Effective CPI for eight cores Effective IPC for eight cores 
TPC-C 7.2 18 0.225 44 
SPECJBB 5.6 140 0.175 5.7 
SPECWeb99 6.6 1.65 0.206 4.8 





Figure 4.29 The per-thread CPI, the per-core CPI, the effective eight-core CPI, and the effective IPC (inverse of 
CPI) for the eight-core TI processor. 
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Figure 4.30 Breakdown of the status on an average thread. Executing indicates the 
thread issues an instruction in that cycle. Ready but not chosen means it could issue, 
but another thread has been chosen, and not ready indicates that the thread is awaiting 
the completion of an event (a pipeline delay or cache miss, for example). 
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Figure 4.31 The breakdown of causes for a thread being not ready.The contribution 
to the "other" category varies. In TPC-C, store buffer full is the largest contributor; in 
SPECJBB, atomic instructions are the largest contributor; and in SPECWeb99, both fac- 
tors contribute. 
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Performance of Muiticore Processors on SPEC Benchmarks 


Among recent processors, Tl is uniquely characterized by an intense focus on 
thread-level parallelism versus instruction-level parallelism. It uses multithread- 
ing to achieve performance from a simple RISC pipeline, and it uses multipro- 
cessing with eight cores on a die to achieve high throughput for server 
applications. In contrast, the dual-core Power5, Opteron, and Pentium D use both 
multiple issue and muiticore. Of course, exploiting significant ILP requires much 
bigger processors, with the result being that fewer cores fit on a chip in compari- 
son to Tl. Figure 4.32 summarizes the features of these muiticore chips. 

In addition to the differences in emphasis on ILP versus TLP, there are several 
other fundamental differences in the designs. Among the most important are 


There are significant differences in floating-point support and performance. 
The Power5 puts a major emphasis on floating-point performance, the 
Opteron and Pentium allocate significant resources, and the Tl almost 
ignores it. As a result, Sun is unlikely to provide any benchmark results for 
floating-point applications. A comparison that included only integer pro- 
grams would be unfair to the three processors that include significant 
floating-point hardware (and the silicon and power cost associated with it). In 
contrast, a comparison using only floating-point applications would be unfair 


totheTl. 


The multiprocessor expandability of these systems differs and that affects the 
memory system design and the use of external interfaces. Power5 is designed 
































Characteristic SUNT1 AMD Opteron Intel Pentium D IBM Power5 
Cores 8 2 2 2 
Instruction issues per clock 1 3 3 4 

core 

Multithreading Fine-grained No SMT SMT 
Caches 16/8 64/64 12Kuops/16 64/32 

LI T/D in KB per core 3 MB shared 1 MB/core 1 MB/core L2: 19 MB shared 
L2 per core/shared L3:36 MB 
L3 (off-chip) 

Peak memory bandwidth (DDR2 34.4 GB/sec 8.6 GB/sec 4.3 GB/sec 17.2 GB/sec 
DRAMs) 

Peak MIPS 9600 7200 9600 7600 
FLOPS 1200 4800 (w. SSE) 6400 (w. SSE) 7600 
Clock rate (GHz) 12 2.4 3.2 19 
Transistor count (M) 300 233 230 276 

Die size (mm’) 379 199 206 389 
Power (W) 79 110 130 125 





Figure 4.32 Summary of the features and characteristics of four muiticore processors. 
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for the most expandability. The Pentium and Opteron design offer limited 
multiprocessor support. The TI is not expandable to a larger system. 


e The implementation technologies vary, making comparisons based on die 
size and power more difficult. 


e There are significant differences in the assumptions about memory systems 
and the memory bandwidth available. For benchmarks with high cache miss 
rates, such as TPC-C and similar programs, the processors with larger mem- 
ory bandwidth have a significant advantage. 


Nonetheless, given the importance of the trade-off between [LP-centric and 
TLP-centric designs, it would be useful to try to quantify the performance differ- 
ences as well as the efficacy of the approaches. Figure 4.33 shows the perfor- 
mance of the four multicore processors using the SPECRate CPU benchmarks, 
the SPECJBB2005 Java business benchmark, the SPECWeb05 Web server 
benchmark, and a TPC-C-like benchmark. 

Figure 4.34 shows efficiency measures in terms of performance per unit die 
area and per watt for the four dual-core processors, with the results normalized to 
the measurement on the Pentium D. The most obvious distinction is the signifi- 
cant advantage in terms of performance/watt for the Sun TI processor on the 
TPC-C-like and SPECJBBO5 benchmarks. These measurements clearly demon- 
strate that for multithreaded applications, a TLP approach may be much more 
power efficient than an ILP-intensive approach. This is the strongest evidence to 
date that the TLP route may provide a way to increase performance in a power- 
efficient fashion. 
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Figure 4.33 Four dual-core processors showing their performance on a variety of 
SPEC benchmarks and a TPC-C-like benchmark.All the numbers are normalized to the 
Pentium D (which is therefore at 1 for all the benchmarks). Some results are estimates 
from slightly larger configurations (e.g., four cores and two processors, rather than two 
cores and one processor), including the Opteron SPECJBB2005 result, the Power5 
SPECWeb05 result, and the TPCC results for the Power5, Opteron, and Pentium D. At 
the current time, Sun has refused to release SPECRate results for either the integer or FP 
portion of the suite. 
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Figure 4.34 Performance efficiency on SPECRate for four dual-core processors, nor- 
malized to the Pentium D metric (which is always 1). 


It is too early to conclude whether the TLP-intensive approaches will win 
across the board. If typical server applications have enough threads to keep TI 
busy and the per-thread performance is acceptable, the Tl approach will be tough 
to beat. If single-threaded performance remains important in server or desktop 
environments, then we may see the market further fracture with significantly dif- 
ferent processors for throughput-oriented environments and environments where 
higher single-thread performance remains important. 


Fallacies and Pitfalls 


Given the lack of maturity in our understanding of parallel computing, there are 
many hidden pitfalls that will be uncovered either by careful designers or by 
unfortunate ones. Given the large amount of hype that has surrounded multi- 
processors, especially at the high end, common fallacies abound. We have 
included a selection of these. 


Measuring performance of multiprocessors by linear speedup versus execution 
time. 


"Mortar shot" graphs—plotting performance versus number of processors, show- 
ing linear speedup, a plateau, and then a falling off—have long been used to 
judge the success of parallel processors. Although speedup is one facet of a paral- 
lel program, it is not a direct measure of performance. The first question is the 
power of the processors being scaled: A program that linearly improves perfor- 
mance to equal 100 Intel 486s may be slower than the sequential version on a 
Pentium 4. Be especially careful of floating-point-intensive programs; processing 
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Fallacy 


elements without hardware assist may scale wonderfully but have poor collective 
performance. 

Comparing execution times is fair only if you are comparing the best algo- 
rithms on each computer. Comparing the identical code on two computers may 
seem fair, but it is not; the parallel program may be slower on a uniprocessor than 
a sequential version. Developing a parallel program will sometimes lead to algo- 
rithmic improvements, so that comparing the previously best-known sequential 
program with the parallel code—which seems fair—will not compare equivalent 
algorithms. To reflect this issue, the terms relative speedup (same program) and 
true speedup (best program) are sometimes used. 

Results that suggest superlinear performance, when a program on n pro- 
cessors is more than n times faster than the equivalent uniprocessor, may indicate 
that the comparison is unfair, although there are instances where "real" superlin- 
ear speedups have been encountered. For example, some scientific applications 
regularly achieve superlinear speedup for small increases in processor count (2 or 
4 to 8 or 16). These results usually arise because critical data structures that do 
not fit into the aggregate caches of a multiprocessor with 2 or 4 processors fit into 
the aggregate cache of a multiprocessor with 8 or 16 processors. 

In summary, comparing performance by comparing speedups is at best tricky 
and at worst misleading. Comparing the speedups for two different multiproces- 
sors does not necessarily tell us anything about the relative performance of the 
multiprocessors. Even comparing two different algorithms on the same multipro- 
cessor is tricky, since we must use true speedup, rather than relative speedup, to 
obtain a valid comparison. 


Amdahl's Law doesn't apply to parallel computers. 


In 1987, the head of a research organization claimed that Amdahl's Law (see Sec- 
tion 1.9) had been broken by an MIMD multiprocessor. This statement hardly 
meant, however, that the law has been overturned for parallel computers; the 
neglected portion of the program will still limit performance. To understand the 
basis of the media reports, let's see what Amdahl [1967] originally said: 


A fairly obvious conclusion which can be drawn at this point is that the effort 
expended on achieving high parallel processing rates is wasted unless it is accom- 
panied by achievements in sequential processing rates of very nearly the same 
magnitude, [p. 483] 


One interpretation of the law was that since portions of every program must be 
sequential, there is a limit to the useful economic number of processors—say, 
100. By showing linear speedup with 1000 processors, this interpretation of 
Amdahl's Law was disproved. 

The basis for the statement that Amdahl's Law had been "overcome" was the 
use of scaled speedup. The researchers scaled the benchmark to have a data set 
size that is 1000 times larger and compared the uniprocessor and parallel execu- 
tion times of the scaled benchmark. For this particular algorithm the sequential 
portion of the program was constant independent of the size of the input, and the 
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rest was fully parallel—hence, linear speedup with 1000 processors. Because the 
running time grew faster than linear, the program actually ran longer after scal- 
ing, even with 1000 processors. 

Speedup that assumes scaling of the input is not the same as true speedup and 
reporting it as if it were misleading. Since parallel benchmarks are often run on 
different-sized multiprocessors, it is important to specify what type of application 
scaling is permissible and how that scaling should be done. Although simply 
scaling the data size with processor count is rarely appropriate, assuming a fixed 
problem size for a much larger processor count is often inappropriate as well, 
since it is likely that users given a much larger multiprocessor would opt to run a 
larger or more detailed version of an application. In Appendix H, we discuss dif- 
ferent methods for scaling applications for large-scale multiprocessors, introduc- 
ing a model called time-constrained scaling, which scales the application data 
size so that execution time remains constant across a range of processor counts. 


Fallacy Linear speedups are needed to make multiprocessors cost-effective. 


It is widely recognized that one of the major benefits of parallel computing is to 
offer a "shorter time to solution" than the fastest uniprocessor. Many people, 
however, also hold the view that parallel processors cannot be as cost-effective as 
uniprocessors unless they can achieve perfect linear speedup. This argument says 
that because the cost of the multiprocessor is a linear function of the number 
of processors, anything less than linear speedup means that the ratio of 
performance/cost decreases, making a parallel processor less cost-effective than 
using a uniprocessor. 

The problem with this argument is that cost is not only a function of proces- 
sor count, but also depends on memory, I/O, and the overhead of the system (box, 
power supply, interconnect, etc.). 

The effect of including memory in the system cost was pointed out by Wood 
and Hill [1995]. We use an example based on more recent data using TPC-C and 
SPECRate benchmarks, but the argument could also be made with a parallel sci- 
entific application workload, which would likely make the case even stronger. 

Figure 4.35 shows the speedup for TPC-C, SPECintRate and SPECfpRate on 
an IBM eserver p5 multiprocessor configured with 4 to 64 processors. The figure 
shows that only TPC-C achieves better than linear speedup. For SPECintRate and 
SPECfpRate, speedup is less than linear, but so is the cost, since unlike TPC-C 
the amount of main memory and disk required both scale less than linearly. 

As Figure 4.36 shows, larger processor counts can actually be more cost- 
effective than the four-processor configuration. In the future, as the cost of multi- 
ple processors decreases compared to the cost of the support infrastructure (cabi- 
nets, power supplies, fans, etc.), the performance/cost ratio of larger processor 
configurations will improve further. 

In comparing the cost-performance of two computers, we must be sure to 
include accurate assessments of both total system cost and what performance is 
achievable. For many applications with larger memory demands, such a compari- 
son can dramatically increase the attractiveness of using a multiprocessor. 
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Figure 4.35 Speedup for three benchmarks on an IBM eserver p5 multiprocessor 
when configured with 4, 8,16, 32, and 64 processors.The dashed line shows linear 
speedup. 


Scalability is almost free. 


The goal of scalable parallel computing was a focus of much of the research and 
a significant segment of the high-end multiprocessor development from the mid- 
1980s through the late 1990s. In the first half of that period, it was widely held 
that you could build scalability into a multiprocessor and then simply offer the 
multiprocessor at any point on the scale from a small to large number of proces- 
sors without sacrificing cost-effectiveness. The difficulty with this view is that 
multiprocessors that scale to larger processor counts require substantially more 
investment (in both dollars and design time) in the interprocessor communication 
network, as well as in aspects such as operating system support, reliability, and 
reconfigurability. 

As an example, consider the Cray T3E, which used a 3D torus capable of 
scaling to 2048 processors as an interconnection network. At 128 processors, it 
delivers a peak bisection bandwidth of 38.4 GB/sec, or 300 MB/sec per proces- 
sor. But for smaller configurations, the contemporaneous Compaq AlphaServer 
ES40 could accept up to 4 processors and has 5.6 GB/sec of interconnect band- 
width, or almost four times the bandwidth per processor. Furthermore, the cost 
per processor in a Cray T3E is several times higher than the cost in the ES40. 

Scalability is also not free in software: To build software applications that 
scale requires significantly more attention to load balance, locality, potential con- 
tention for shared resources, and the serial (or partly parallel) portions of the 
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Figure 4.36 The performance/cost relative to a 4-processor system for three bench- 
marks run on an IBM eserver p5 multiprocessor containing from 4 to 64 processors 
shows that the larger processor counts can be as cost-effective as the 4-processor 
configuration. For TPCC the configurations are those used in the official runs, which 
means that disk and memory scale nearly linearly with processor count, and a 64-pro- 
cessor machine is approximately twice as expensive as a 32-processor version. In con- 
trast, the disk and memory are scaled more slowly (although still faster than necessary 
to achieve the best SPECRate at 64 processors). In particular the disk configurations go 
from one drive for the 4-processor version to four drives (140 GB) for the 64-processor 
version. Memory is scaled from 8 GB for the 4-processor system to 20 GB for the 64- 
processor system. 


program. Obtaining scalability for real applications, as opposed to toys or small 
kernels, across factors of more than five in processor count, is a major challenge. 
In the future, new programming approaches, better compiler technology, and per- 
formance analysis tools may help with this critical problem, on which little 
progress has been made in 30 years. 


Not developing the software to take advantage of, or optimize for, a multiproces- 
sor architecture. 


There is a long history of software lagging behind on massively parallel proces- 
sors, possibly because the software problems are much harder. We give one 
example to show the subtlety of the issues, but there are many examples we could 
choose from! 

One frequently encountered problem occurs when software designed for a 
uniprocessor is adapted to a multiprocessor environment. For example, the SGI 
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operating system originally protected the page table data structure with a single 
lock, assuming that page allocation is infrequent. In a uniprocessor, this does 
not represent a performance problem. In a multiprocessor, it can become a 
major performance bottleneck for some programs. Consider a program that 
uses a large number of pages that are initialized at start-up, which UNIX does 
for statically allocated pages. Suppose the program is parallelized so that multi- 
ple processes allocate the pages. Because page allocation requires the use of the 
page table data structure, which is locked whenever it is in use, even an OS ker- 
nel that allows multiple threads in the OS will be serialized if the processes all 
try to allocate their pages at once (which is exactly what we might expect at ini- 
tialization time!). 

This page table serialization eliminates parallelism in initialization and has 
significant impact on overall parallel performance. This performance bottle- 
neck persists even under multiprogramming. For example, suppose we split the 
parallel program apart into separate processes and run them, one process per 
processor, so that there is no sharing between the processes. (This is exactly 
what one user did, since he reasonably believed that the performance problem 
was due to unintended sharing or interference in his application.) Unfortu- 
nately, the lock still serializes all the processes—so even the multiprogramming 
performance is poor. This pitfall indicates the kind of subtle but significant per- 
formance bugs that can arise when software runs on multiprocessors. Like 
many other key software components, the OS algorithms and data structures 
must be rethought in a multiprocessor context. Placing locks on smaller por- 
tions of the page table effectively eliminates the problem. Similar problems 
exist in memory structures, which increases the coherence traffic in cases 
where no sharing is actually occurring. 


Concluding Remarks 


For more than 30 years, researchers and designers have predicted the end of uni- 
processors and their dominance by multiprocessors. During this time period the 
rise of microprocessors and their rapid performance growth has largely limited 
the role of multiprocessing to limited market segments. In 2006, we are clearly at 
an inflection point where multiprocessors and thread-level parallelism will play a 
greater role across the entire computing spectrum. This change is driven by sev- 
eral phenomena: 


1. The use of parallel processing in some domains is much better understood. 
First among these is the domain of scientific and engineering computation. 
This application domain has an almost limitless thirst for more computation. 
It also has many applications that have lots of natural parallelism. Nonethe- 
less, it has not been easy: Programming parallel processors even for these 
applications remains very challenging, as we discuss further in Appendix H. 


2. The growth in server applications for transaction processing and Web ser- 
vices, as well as multiprogrammed environments, has been enormous, and 
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these applications have inherent and more easily exploited parallelism, 
through the processing of independent threads 


3. After almost 20 years of breakneck performance improvement, we are in the 
region of diminishing returns for exploiting ILP, at least as we have known it. 
Power issues, complexity, and increasing inefficiency has forced designers to 
consider alternative approaches. Exploiting thread-level parallelism is the 
next natural step. 


4. Likewise, for the past 50 years, improvements in clock rate have come from 
improved transistor speed. As we begin to see reductions in such improve- 
ments both from technology limitations and from power consumption, 
exploiting multiprocessor parallelism is increasingly attractive. 


In the 1995 edition of this text, we concluded the chapter with a discussion of 
two then-current controversial issues: 


1. What architecture would very large-scale, microprocessor-based multiproces- 
sors use? 


2. What was the role for multiprocessing in the future of microprocessor archi- 
tecture? 


The intervening years have largely resolved these two questions. 

Because very large-scale multiprocessors did not become a major and grow- 
ing market, the only cost-effective way to build such large-scale multiprocessors 
was to use clusters where the individual nodes are either single microprocessors 
or moderate-scale, shared-memory multiprocessors, which are simply incorpo- 
rated into the design. We discuss the design of clusters and their interconnection 
in Appendices E and H. 

The answer to the second question has become clear only recently, but it has 
become astonishingly clear. The future performance growth in microprocessors, 
at least for the next five years, will almost certainly come from the exploitation of 
thread-level parallelism through multicore processors rather than through exploit- 
ing more ILP. In fact, we are even seeing designers opt to exploit less ILP in 
future processors, instead concentrating their attention and hardware resources 
on more thread-level parallelism. The Sun TI is a step in this direction, and in 
March 2006, Intel announced that its next round of multicore processors would 
be based on a core that is less aggressive in exploiting ILP than the Pentium 4 
Netburst core. The best balance between ILP and TLP will probably depend on a 
variety of factors including the applications mix. 

In the 1980s and 1990s, with the birth and development of ILP, software in 
the form of optimizing compilers that could exploit ILP was key to its success. 
Similarly, the successful exploitation of thread-level parallelism will depend as 
much on the development of suitable software systems as it will on the contribu- 
tions of computer architects. Given the slow progress on parallel software in the 
past thirty-plus years, it is likely that exploiting thread-level parallelism broadly 
will remain challenging for years to come. 
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4.11 Historical Perspective and References 
Section K.5 on the companion CD looks at the history of multiprocessors and 
parallel processing. Divided by both time period and architecture, the section 
includes discussions on early experimental multiprocessors and some of the great 
debates in parallel processing. Recent advances are also covered. References for 
further reading are included. 


Case Studies with Exercises by David A.Wood 


Case Study 1: Simple, Bus-Based Multiprocessor 


Concepts illustrated by this case study 


e Snooping Coherence Protocol Transitions 
e Coherence Protocol Performance 
e Coherence Protocol Optimizations 


e Synchronization 


The simple, bus-based multiprocessor illustrated in Figure 4.37 represents a com- 
monly implemented symmetric shared-memory architecture. Each processor has 
a single, private cache with coherence maintained using the snooping coherence 
protocol of Figure 4.7. Each cache is direct-mapped, with four blocks each hold- 
ing two words. To simplify the illustration, the cache-address tag contains the full 
address and each word shows only two hex characters, with the least significant 
word on the right. The coherence states are denoted M, S, and I for Modified, 
Shared, and Invalid. 


4.1 [10/10/10/10/10/10/10] <4.2> For each part of this exercise, assume the initial 
cache and memory state as illustrated in Figure 4.37. Each part of this exercise 
specifies a sequence of one or more CPU operations of the form: 


P#: <op> <address> [ <-- <value> ] 
where P# designates the CPU (e.g., PO), <op> is the CPU operation (e.g., read or 


write), <address> denotes the memory address, and <val ue> indicates the new 
word to be assigned on a write operation. 


Treat each action below as independently applied to the initial state as given in 
Figure 4.37. What is the resulting state (i.e., coherence state, tags, and data) of 
the caches and memory after the given action? Show only the blocks that change, 
for example, PO.BO: (I, 120,0001) indicates that CPU PO's block BO has the 
final state of I, tag of 120, and data words 00 and 01. Also, what value is returned 
by each read operation? 


a. [10]<4.2>PO: read 120 
b. [10]<4.2>PO: write 120 <-- 80 


4.2 
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c. [10]<4.2>P15: write 120 <— 80 
d. [10]<4.2>P1: read 110 

e. [10]<4.2>PO: write 108 <-- 48 
f. [10]<4.2>PO: write 130 <-- 78 
g. [10]<4.2>P15: write 130 <-- 78 


[20/20/20/20] <4.3> The performance of a snooping cache-coherent multiproces- 
sor depends on many detailed implementation issues that determine how quickly 
a cache responds with data in an exclusive or M state block. In some implementa- 
tions, a CPU read miss to a cache block that is exclusive in another processor's 
cache is faster than a miss to a block in memory. This is because caches are 
smaller, and thus faster, than main memory. Conversely, in some implementa- 
tions, misses satisfied by memory are faster than those satisfied by caches. This is 
because caches are generally optimized for "front side" or CPU references, rather 
than "back side" or snooping accesses. 


For the multiprocessor illustrated in Figure 4.37, consider the execution of a 
sequence of operations on a single CPU where 


e CPU read and write hits generate no stall cycles. 


e CPU read and write misses generate Nmem and Neache Stall cycles if sat- 
isfied by memory and cache, respectively. 
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e CPU write hits that generate an invalidate incur Ninyalidate stall cycles. 


e a writeback of a block, either due to a conflict or another processor's re- 
quest to an exclusive block, incurs an additional Nwriteback Stall cycles. 


Consider two implementations with different performance characteristics sum- 
marized in Figure 4.38. 


Consider the following sequence of operations assuming the initial cache state in 
Figure 4.37. For simplicity, assume that the second operation begins after the first 
completes (even though they are on different processors): 


Pl: read 110 
P15: read 110 


For Implementation 1, the first read generates 80 stall cycles because the read is 
satisfied by PO's cache. P1 stalls for 70 cycles while it waits for the block, and PO 
stalls for 10 cycles while it writes the block back to memory in response to Pi's 
request. Thus the second read by P15 generates 100 stall cycles because its miss 
is satisfied by memory. Thus this sequence generates a total of 180 stall cycles. 


For the following sequences of operations, how many stall cycles are generated 
by each implementation? 


a. [20]<4.3> PO: read 120 
PO: read 128 
PO: read 130 
b. [20]<4.3> PO: read 100 
PO: write 108 <-- 48 
P0: write 130 <-- 78 
c. [20]<4.3> P1: read 120 
Pl: read 128 
Pl: read 130 
d. [20]<4.3> Pl: read 100 
Pl: write 108 <-- 48 
Pl: write 130 <— 78 

















Parameter Implementation 1 Implementation 2 
Nisi 100 100 
Neache 70 130 
Ninvalidate 15 15 

N sistas 10 10 





Figure 4.38 Snooping coherence latencies. 
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[20] <4.2> Many snooping coherence protocols have additional states, state tran- 
sitions, or bus transactions to reduce the overhead of maintaining cache coher- 
ency. In Implementation 1 of Exercise 4.2, misses are incurring fewer stall cycles 
when they are supplied by cache than when they are supplied by memory. Some 
coherence protocols try to improve performance by increasing the frequency of 
this case. 


A common protocol optimization is to introduce an Owned state (usually denoted 
O). The Owned state behaves like the Shared state, in that nodes may only read 
Owned blocks. But it behaves like the Modified state, in that nodes must supply 
data on other nodes’ read and write misses to Owned blocks. A read miss to a 
block in either the Modified or Owned states supplies data to the requesting node 
and transitions to the Owned state. A write miss to a block in either state Modi- 
fied or Owned supplies data to the requesting node and transitions to state 
Invalid. This optimized MOSI protocol only updates memory when a node 
replaces a block in state Modified or Owned. 


Draw new protocol diagrams with the additional state and transitions. 


[20/20/20/20] <4.2> For the following code sequences and the timing parameters 
for the two implementations in Figure 4.38, compute the total stall cycles for the 
base MSI protocol and the optimized MOSI protocol in Exercise 4.3. Assume 
state transitions that do not require bus transactions incur no additional stall 
cycles. 


a. [20]<4.2> Pl: read 110 
P15: read 110 
PO: read 110 
b. [20]<4.2> Pl: read 120 
P15: read 120 
P0: read 120 
c. [20]<4.2> PO: write 120 <— 80 
P15: read 120 
PO: read 120 
d. [20]<4.2> PO: write 108 <-- 88 
P15: read 108 
PO: write 108 <-- 98 


[20] <4.2> Some applications read a large data set first, then modify most or all 
of it. The base MSI coherence protocol will first fetch all of the cache blocks in 
the Shared state, and then be forced to perform an invalidate operation to upgrade 
them to the Modified state. The additional delay has a significant impact on some 
workloads. 
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An additional protocol optimization eliminates the need to upgrade blocks that 
are read and later written by a single processor. This optimization adds the Exclu- 
sive (E) state to the protocol, indicating that no other node has a copy of the 
block, but it has not yet been modified. A cache block enters the Exclusive state 
when a read miss is satisfied by memory and no other node has a valid copy. CPU 
reads and writes to that block proceed with no further bus traffic, but CPU writes 
cause the coherence state to transition to Modified. Exclusive differs from Modi- 
fied because the node may silently replace Exclusive blocks (while Modified 
blocks must be written back to memory). Also, a read miss to an Exclusive block 
results in a transition to Shared, but does not require the node to respond with 
data (since memory has an up-to-date copy). 


Draw new protocol diagrams for a MESI protocol that adds the Exclusive state 
and transitions to the base MSI protocol's Modified, Shared, and Invalidate 
states. 


[20/20/20/20/20] <4.2> Assume the cache contents of Figure 4.37 and the timing 
of Implementation 1 in Figure 4.38. What are the total stall cycles for the follow- 
ing code sequences with both the base protocol and the new MESI protocol in 
Exercise 4.5? Assume state transitions that do not require bus transactions incur 
no additional stall cycles. 


a. [20]<4.2> PO: read 100 

PO: write 100 <-- 40 
b. [20]<4.2> PO: read 120 

PO: write 120 <-- 60 
c. [20]<4.2> PO: read 100 

PO: read 120 
d. [20]<4.2> P0: read 100 

Pl: write 100 <-- 60 
e. [20]<4.2> PO: read 100 

PO: write 100 <-- 60 

Pl: write 100 <-- 40 


[20/20/20/20] <4.5> The test-and-set spin lock is the simplest synchronization 
mechanism possible on most commercial shared-memory machines. This spin 
lock relies on the exchange primitive to atomically load the old value and store a 
new value. The lock routine performs the exchange operation repeatedly until it 
finds the lock unlocked (i.e., the returned value is 0). 


tas: DADDUI R2,R0,#1 
lockit: EXCH R2,0(R1) 
BNEZ R2, lockit 


Unlocking a spin lock simply requires a store of the value 0. 


unlock: SW RO,0(R1) 
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As discussed in Section 4.7, the more optimized test-and-test-and-set lock uses a 
load to check the lock, allowing it to spin with a shared variable in the cache. 


tatas: LD R2, 0(R1) 
BNEZ R2, tatas 
DADDUI R2,R0,#1 
EXCH R2,0(R1) 
BNEZ R2, tatas 


Assume that processors PO, P1, and P15 are all trying to acquire a lock at address 
0x100 (i.e., register RI holds the value 0x100). Assume the cache contents from 
Figure 4.37 and the timing parameters from Implementation | in Figure 4.38. For 
simplicity, assume the critical sections are 1000 cycles long. 


a. [20] <4.5> Using the test-and-set spin lock, determine approximately how 
many memory stall cycles each processor incurs before acquiring the lock. 


b. [20] <4.5> Using the test-and-test-and-set spin lock, determine approxi- 
mately how many memory stall cycles each processor incurs before acquiring 
the lock. 


C. [20] <4.5> Using the test-and-set spin lock, approximately how many bus 
transactions occur? 





d. [20] <4.5> Using the test-and-test-and-set spin lock, approximately how 
many bus transactions occur? 


Case Study 2: A Snooping Protocol for a Switched Network 


Concepts illustrated by this case study 


e Snooping Coherence Protocol Implementation 
e Coherence Protocol Performance 
e Coherence Protocol Optimizations 


e Memory Consistency Models 


The snooping coherence protocols in Case Study 1 describe coherence at an 
abstract level, but hide many essential details and implicitly assume atomic 
access to the shared bus to provide correct operation. High-performance snoop- 
ing systems use one or more pipelined, switched interconnects that greatly 
improve bandwidth but introduce significant complexity due to transient states 
and nonatomic transactions. This case study examines a high-performance 
snooping system, loosely modeled on the Sun E6800, where multiple processor 
and memory nodes are connected by separate switched address and data net- 
works. 

Figure 4.39 illustrates the system organization (middle) with enlargements of 
a single processor node (left) and a memory module (right). Like most high- 
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Figure 4.39 Snooping system with switched interconnect. 


performance shared-memory systems, this system provides multiple memory 
modules to increase memory bandwidth. The processor nodes contain a CPU, 
cache, and a cache controller that implements the coherence protocol. The CPU 
issues read and write requests to the cache controller over the REQUEST bus and 
sends/receives data over the DATA bus. The cache controller services these 
requests locally, (i.e., on cache hits) and on a miss issues a coherence request 
(e.g., GetShared to request a read-only copy, GetModified to get an exclusive 
copy) by sending it to the address network via the ADDR_OUT queue. The 
address network uses a broadcast tree to make sure that all nodes see all coher- 
ence requests in a total order. All nodes, including the requesting node, receive 
this request in the same order (but not necessarily the same cycle) on the 
ADDR_IN queue. This total order is essential to ensure that all cache controllers 
act in concert to maintain coherency. 

The protocol ensures that at most one node responds, sending a data message 
on the separate, unordered point-to-point data network. 

Figure 4.40 presents a (simplified) coherence protocol for this system in tabu- 
lar form. Tables are commonly used to specify coherence protocols since the 
multitude of states makes state diagrams too ungainly. Each row corresponds to a 
block's coherence state, each column represents an event (e.g., a message arrival 
or processor operation) affecting that block, and each table entry indicates the 
action and new next state (if any). Note that there are two types of coherence 
states. The stable states are the familiar Modified (M), Shared (S), or Invalid (1) 
and are stored in the cache. Transient states arise because of nonatomic transi- 
tions between stable coherence states. An important source of this nonatomicity 
arises because of races within the pipelined address network and between the 
address and data networks. For example, two cache controllers may send request 
messages in the same cycle for the same block, but may not find out for several 
cycles how the tie is broken (this is done by monitoring the ADDRJN queue, to 
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Figure 4.40 Broadcast snooping cache controller transitions. 


see in which order the requests arrive). Cache controllers use transient states to 
remember what has transpired in the past while they wait for other actions to 
occur in the future. Transient states are typically stored in an auxiliary structure 
such as an MSHR, rather than the cache itself. In this protocol, transient state 
names encode their initial state, their intended state, and a superscript indicating 
which messages are still outstanding. For example, the state IS“ indicates that the 
block was in state I, wants to become state S, but needs to see its own request 
message (i.e., GetShared) arrive on the ADDR_IN queue before making the tran- 
sition. 

Events at the cache controller depend on CPU requests and incoming request 
and data messages. The OwnReq event means that a CPU's own request has 
arrived on the ADDR_IN queue. The Replacement event is a pseudo-CPU event 
generated when a CPU read or write triggers a cache replacement. Cache control- 
ler behavior is detailed in Figure 4.40, where each entry contains an <action/next 
state> tuple. When the current state of a block corresponds to the row of the entry 
and the next event corresponds to the column of the entry, then the specified 
action is performed and the state of the block is changed to the specified new 
state. If only a next state is listed, then no action is required. If no new state is 
listed, the state remains unchanged. Impossible cases are marked "error" and 
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represent error conditions, "z" means the requested event cannot currently be 
processed, and "—" means no action or state change is required. 

The following example illustrates the basic operation of this protocol. 
Assume that PO attempts a read to a block that is in state I (Invalid) in all caches. 
The cache controller's action—determined by the table entry that corresponds to 
state I and event "read"—is "send GetS/IS*?," which means that the cache con- 
troller should issue a GetS (i.e., GetShared) request to the address network and 
transition to transient state IS“? to wait for the address and data messages. In the 
absence of contention, PO's cache controller will normally receive its own GetS 
message first, indicated by the OwnReq column, causing a transition to state IS”. 
Other cache controllers will handle this request as "Other GetS" in state I. When 
the memory controller sees the request on its ADDR_IN queue, it reads the block 
from memory and sends a data message to PO. When the data message arrives at 
PO's DATA_IN queue, indicated by the Data column, the cache controller saves 
the block in the cache, performs the read, and sets the state to S (i.e., Shared). 

A somewhat more complex case arises if node PI holds the block in state M. 
In this case, Pi's action for "Other GetS" causes it to send the data both to PO and 
to memory, and then transition to state S. PO behaves exactly as before, but the 
memory must maintain enough logic or state to (1) not respond to PO's request 
(because PI will respond) and (2) wait to respond to any future requests for this 
block until it receives the data from PI. This requires the memory controller to 
implement its own transient states (not shown). Exercise 4.11 explores alternative 
ways to implement this functionality. 

More complex transitions occur when other requests intervene or cause 
address and data messages to arrive out of order. For example, suppose the cache 
controller in node PO initiates a writeback of a block in state Modified. As Figure 
4.40 shows, the controller does this by issuing a PutModified coherence request 
to the ADDR_OUT queue. Because of the pipelined nature of the address net- 
work, node PO cannot send the data until it sees its own request on the ADDR_IN 
queue and determines its place in the total order. This creates an interval, called a 
window of vulnerability, where another node's request may change the action that 
should be taken by a cache controller. For example, suppose that node PI has 
issued a GetModified request (i.e., requesting an exclusive copy) for the same 
block that arrives during PO's window of vulnerability for the PutModified 
request. In this case, Pi's GetModified request logically occurs before PO's Put- 
Modified request, making it incorrect for PO to complete the writeback. PO's 
cache controller must respond to Pi's GetModified request by sending the block 
to PI and invalidating its copy. However, PO's PutModified request remains pend- 
ing in the address network, and both PO and PI must ignore the request when it 
eventually arrives (node PO ignores the request since its copy has already been 
invalidated; node PI ignores the request since the PutModified was sent by a dif- 
ferent node). 


[10/10/10/10/10/10/10] <4.2> Consider the switched network snooping protocol 
described above and the cache contents from Figure 4.37. What are the sequence 
of transient states that the affected cache blocks move through in each of the fol- 
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lowing cases for each of the affected caches? Assume that the address network 
latency is much less than the data network latency. 


a. [10]<4.2>P0: read 120 

b. [10]<4.2>P0: write 120 <-- 80 
c. [10] <4.2> P15: write 120 <-- 80 
d. [10]<4.2>P1: read 110 

e. [10]<4.2>P0: write 108 <-- 48 
f. [10]<4.2>P0: write 130 <-- 78 
g. [10]<4.2>P15: write 130 <-- 78 


[15/15/15/15/15/15/15] <4.2> Consider the switched network snooping protocol 
described above and the cache contents from Figure 4.37. What are the sequence 
of transient states that the affected cache blocks move through in each of the fol- 
lowing cases? In all cases, assume that the processors issue their requests in the 
same cycle, but the address network orders the requests in top-down order. Also 
assume that the data network is much slower than the address network, so that the 
first data response arrives after all address messages have been seen by all nodes. 


a. [15]<4.2> PO: read 120 
Pl: read 120 
b. [15]<4.2> PO: read 120 
Pl: write 120 <-- 80 
c. [15]<4.2> PO: write 120 <-- 80 
Pl: read 120 
d. [15]<4.2> PO: write 120 <— 80 
Pl: write 120 <-- 90 
e. [15]<4.2> PO: replace 110 
Pl: read 110 
f. [15]<4.2> Pl: write 110 <-- 80 
PO replace 110 
g. [15]<4.2> Pl: read 110 


PO replace 110 
[20/20/20/20/20/20/20] <4.2, 4.3> The switched interconnect increases the per- 
formance of a snooping cache-coherent multiprocessor by allowing multiple 
requests to be overlapped. Because the controllers and the networks are pipe- 
lined, there is a difference between an operation's latency (i.e., cycles to com- 
plete the operation) and overhead (i.e., cycles until the next operation can begin). 


For the multiprocessor illustrated in Figure 4.39, assume the following latencies 
and overheads: 
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e CPU read and write hits generate no stall cycles. 


e A CPU read or write that generates a replacement event issues the corre- 
sponding GetShared or GetModified message before the PutModified 
message (e.g., using a writeback buffer). 

e A cache controller event that sends a request message (e.g., GetShared) 
has latency Lena req and blocks the controller from processing other events 


for Ocend_req Cycles 


e A cache controller event that reads the cache and sends a data message has 
latency Lsenddata and overhead Osenddata cycles. 

e A cache controller event that receives a data message and updates the 
cache has latency Lrey data and overhead Orey data- 


e A memory controller has latency Lyead memory and overhead O;ead memoiy 
cycles to read memory and send a data message. 


e A memory controller has latency Lyritememory and Overhead Oyritememoiy 
cycles to write a data message to memory. 


e In the absence of contention, a request message has network latency 
I and overhead Oreq msg cycles. 


e In the absence of contention, a data message has network latency Laata msg 
and overhead Ogata msg Cycles. 


*req_msg 


Consider an implementation with the performance characteristics summarized in 
Figure 4.41. 


For the following sequences of operations and the cache contents from Figure 
Figure 4.37 and the implementation parameters in Figure 4.41, how many stall 
cycles does each processor incur for each memory request? Similarly, for how 
many cycles are the different controllers occupied? For simplicity, assume (1) 
each processor can have only one memory operation outstanding at a time, (2) if 
two nodes make requests in the same cycle and the one listed first "wins," the 





Implementation 1 


























Action Latency Overhead 
send_req 4 1 
send_data 20 4 
rcv_data 15 4 
read_memory 100 20 
write_memory 100 20 
req_msg 8 

data_msg 30 5 





Figure 4.41 Switched snooping coherence latencies and overheads. 
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later node must stall for the request message overhead, and (3) all requests map 
to the same memory controller. 


a. [20] <4.2,4.3> PO read 120 
b. [20] <4.2,4.3> PO write 120 <-- 80 
c. [20] <4.2, 4.3> P15: write 120 <-- 80 
d. [20] <4.2,4.3> Pl: read 110 
e. [20] <4.2, 4.3>P0: read 120 
P15: read 128 
.2, 4.3> PO: read 100 
Pl: write 110 <-- 78 
g. [20] <4.2,4.3> PO: write 100 <-- 28 
Pl: write 100 <-- 48 


[25/25] <4.2, 4.4> The switched snooping protocol of Figure 4.40 assumes that 
memory "knows" whether a processor node is in state Modified and thus will 
respond with data. Real systems implement this in one of two ways. The first way 
uses a shared "Owned" signal. Processors assert Owned if an "Other GetS" or 
"Other GetM" event finds the block in state M. A special network ORs the indi- 
vidual Owned signals together; if any processor asserts Owned, the memory con- 
troller ignores the request. Note that in a nonpipelined interconnect, this special 
network is trivial (i.e., it is an OR gate). 


NS 


f. [20] < 














However, this network becomes much more complicated with high-performance 
pipelined interconnects. The second alternative adds a simple directory to the 
memory controller (e.g., 1 or 2 bits) that tracks whether the memory controller is 
responsible for responding with data or whether a processor node is responsible 
for doing so. 


a. [25] <4.2, 4.4> Use a table to specify the memory controller protocol needed 
to implement the second alternative. For this problem, ignore the PUTM mes- 
sage that gets sent on a cache replacement. 


b. [25] <4.2, 4.4> Explain what the memory controller must do to support the 
following sequence, assuming the initial cache contents of Figure 4.37: 


Pl: read 110 
P15: read 110 


[30] <4.2> Exercise 4.3 asks you to add the Owned state to the simple MSI 


snooping protocol. Repeat the question, but with the switched snooping protocol 
above. 


[30] <4.2> Exercise 4.5 asks you to add the Exclusive state to the simple MSI 
snooping protocol. Discuss why this is much more difficult to do with the 
switched snooping protocol. Give an example of the kinds of issues that arise. 
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[20/20/20/20] <4.6> Sequential consistency (SC) requires that all reads and 
writes appear to have executed in some total order. This may require the proces- 
sor to stall in certain cases before committing a read or write instruction. Con- 
sider the following code sequence: 


write A 
read B 


where the wri te A results in a cache miss and the read B results in a cache hit. 
Under SC, the processor must stall read B until after it can order (and thus per- 
form) write A. Simple implementations of SC will stall the processor until the 
cache receives the data and can perform the write. 


Weaker consistency models relax the ordering constraints on reads and writes, 
reducing the cases that the processor must stall. The Total Store Order (TSO) 
consistency model requires that all writes appear to occur in a total order, but 
allows a processor's reads to pass its own writes. This allows processors to imple- 
ment write buffers, which hold committed writes that have not yet been ordered 
with respect to other processor's writes. Reads are allowed to pass (and poten- 
tially bypass) the write buffer in TSO (which they could not do under SC). 


Assume that one memory operation can be performed per cycle and that opera- 
tions that hit in the cache or that can be satisfied by the write buffer introduce no 
stall cycles. Operations that miss incur the latencies listed in Figure 4.41. Assume 
the cache contents of Figure 4.37 and the base switched protocol of Exercise 4.8. 
How many stall cycles occur prior to each operation for both the SC and TSO 
consistency models? 


a. [20]<4.6> PO: write 110 <-- 80 

PO: read 108 
b. [20]<4.6> PO: write 100 <-- 80 

PO: read 108 
c. [20]<4.6> PO: write 110 <— 80 

PO: write 100 <-- 90 
d. [20]<4.6> PO: write 100 <— 80 

PO: write 110 <-- 90 
[20/20] <4.6> The switched snooping protocol above supports sequential consis- 
tency in part by making sure that reads are not performed while another node has 
a writeable block and writes are not performed while another processor has a 
writeable block. A more aggressive protocol will actually perform a write opera- 
tion as soon as it receives its own GetModified request, merging the newly writ- 
ten word(s) with the rest of the block when the data message arrives. This may 
appear illegal, since another node could simultaneously be writing the block. 
However, the global order required by sequential consistency is determined by 
the order of coherence requests on the address network, so the other node's 


write(s) will be ordered before the requester's write(s). Note that this optimiza- 
tion does not change the memory consistency model. 
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Assuming the parameters in Figure 4.41: 
a. [20] <4.6> How significant would this optimization be for an in-order core? 


b. [20] <4.6> How significant would this optimization be for an out-of-order 
core? 


Case Study 3: Simple Directory-Based Coherence 


Concepts illustrated by this case study 


m Directory Coherence Protocol Transitions 
e Coherence Protocol Performance 


e Coherence Protocol Optimizations 


Consider the distributed shared-memory system illustrated in Figure 4.42. Each 
processor has a single direct-mapped cache that holds four blocks each holding two 
words. To simplify the illustration, the cache address tag contains the full address 
and each word shows only two hex characters, with the least significant word on 
the right. The cache states are denoted M, S, and I for Modified, Shared, and 
Invalid. The directory states are denoted DM, DS, and DI for Directory Modified, 


Q 
[fro foio] 
[fe fo] 



































Switched network with point-to-point order 4 


4 


| Memory Owner/ 











Address State sharers Data 
100 
108 
110 
118 
120 
128 
130 

















Figure 4.42 Multiprocessor with directory cache coherence. 
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Directory Shared, and Directory Invalid. The simple directory protocol is described 
in Figures 4.21 and 4.22. 


[10/10/10/10/15/15/15/15] <4.4> For each part of this exercise, assume the initial 


cache and memory state in Figure 4.42. Each part of this exercise specifies a 
sequence of one or more CPU operations of the form: 


P# <op> <address> [ <-- <value> ] 


where P# designates the CPU (e.g., PO), <op> is the CPU operation (e.g., read or 
write), <address> denotes the memory address, and <val ue> indicates the new 
word to be assigned on a write operation. 


What is the final state (i.e., coherence state, tags, and data) of the caches and 


memory after the given sequence of CPU operations has completed? Also, what 
value is returned by each read operation? 


a. [10]<4.4> PO: read 100 
b. [10]<4.4> PO: read 128 
c. [10]<4.4> PO: write 128 <-- 78 
d. [10]<4.4> PO: read 120 
e. [15]<4.4> PO: read 120 


Pl: read 120 
f. [15]<4.4> PO: read 120 
Pl: write 120 <-- 80 
g. [15]<4.4> PO: write 120 <-- 80 
Pl: read 120 
h. [15]<4.4> PO: write 120 <— 80 
Pl: write 120 <-- 90 


[10/10/10/10] <4.4> Directory protocols are more scalable than snooping proto- 
cols because they send explicit request and invalidate messages to those nodes 
that have copies of a block, while snooping protocols broadcast all requests and 
invalidates to all nodes. Consider the 16-processor system illustrated in Figure 
4.42 and assume that all caches not shown have invalid blocks. For each of the 
sequences below, identify which nodes receive each request and invalidate. 


a. [10]<4.4>P0: write 100 <-- 80 
b. [10]<4.4>P0: write 108 <-- 88 
c. [10]<4.4>P0: write 118 <— 90 
d. [10]<4.4>P1: write 128 <-- 98 
[25] <4.4> Exercise 4.3 asks you to add the Owned state to the simple MSI 


snooping protocol. Repeat the question, but with the simple directory protocol 
above. 
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[25] <4.4> Exercise 4.5 asks you to add the Exclusive state to the simple MSI 
snooping protocol. Discuss why this is much more difficult to do with the simple 
directory protocol. Give an example of the kinds of issues that arise. 


Case Study 4: Advanced Directory Protocol 


Concepts illustrated by this case study 


m Directory Coherence Protocol Implementation 
e Coherence Protocol Performance 


e Coherence Protocol Optimizations 


The directory coherence protocol in Case Study 3 describes directory coherence 
at an abstract level, but assumes atomic transitions much like the simple snooping 
system. High-performance directory systems use pipelined, switched intercon- 
nects that greatly improve bandwidth but also introduce transient states and non- 
atomic transactions. Directory cache coherence protocols are more scalable than 
snooping cache coherence protocols for two reasons. First, snooping cache 
coherence protocols broadcast requests to all nodes, limiting their scalability. 
Directory protocols use a level of indirection—a message to the directory—to 
ensure that requests are only sent to the nodes that have copies of a block. Sec- 
ond, the address network of a snooping system must deliver requests in a total 
order, while directory protocols can relax this constraint. Some directory proto- 
cols assume no network ordering, which is beneficial since it allows adaptive 
routing techniques to improve network bandwidth. Other protocols rely on point- 
to-point order (i.e., messages from node PO to node PI will arrive in order). Even 
with this ordering constraint, directory protocols usually have more transient 
states than snooping protocols. Figure 4.43 presents the cache controller state 
transitions for a simplified directory protocol that relies on point-to-point net- 
work ordering. Figure 4.44 presents the directory controller's state transitions. 
For each block, the directory maintains a state and a current owner field or a cur- 
rent sharers list (if any). 

Like the high-performance snooping protocol presented earlier, indexing the 
row by the current state and the column by the event determines the <action/next 
state> tuple. If only a next state is listed, then no action is required. Impossible 
cases are marked "error" and represent error conditions, "z" means the requested 
event cannot currently be processed. 

The following example illustrates the basic operation of this protocol. Sup- 
pose a processor attempts a write to a block in state I (Invalid). The correspond- 
ing tuple is "send GetM/IM“”" indicating that the cache controller should send a 
GetM (GetModified) request to the directory and transition to state IM“. In the 
simplest case, the request message finds the directory in state DI (Directory 
Invalid), indicating that no other cache has a copy. The directory responds with a 
Data message that also contains the number of acks to expect (in this case zero). 
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Replace- Forwarded_ Forwarded_ PutM 
State Read Write ment INV GetS GetM _Ack Data Last ACK 
I send GetS/ send GetM/ error send error error error error error 
IsP IMAP Ack/I 
S do Read send GetM/ I send error error error error error 
IM^P Ack/I 
M do Read do Write send PutM/ error send Data, send Data/I error error error 
MI“ send PutMS 
/Ms* 
IS? z Z Z send Ack/ error error error save error 
ISIP Data, do 
Read/S 
isi? gz Z Z send Ack error error error save error 
Data, do 
Read/I 
IMA? z Z Z send Ack error error error save Data error 
/IM“ 
IM“ z Z Z error IMS“ IMI“ error error do Write/M 
IMI“ z Z Z error error error error error do Write, 
send Data/l 
IMS“ z 7. z send Ack/ z Z error error do Write, 
IMI* send 
Data/S 
MS“ do Read Z $ error send Data send Data /S error error 
MI’ 
MIA z Z % error send Data send Data/I  /l error error 
Figure 4.43 Broadcast snooping cache controller transitions. 
PutM PutMS PutM PutMS 
State GetS GetM (owner) (nonowner) (owner) (nonowner) 
DI send Data, add to send Data, clear error send PutM_Ack error send PutM_Ack 
sharers/DS sharers, set owner/ 
DM 
DS send Data, add to send INVs to sharers, error send PutM_Ack error send PutM_Ack 
sharers/DS clear sharers, set 


owner, send Data/DM 
DM forward GetS,add forward GetM, send save Data, send send PutM_Ack save Data, send PutM_Ack 





to sharers/DMSP INVs to sharers, clear PutM_Ack/DI add to 
sharers, set owner sharers, send 
PutM_Ack/ 
DS 
DMS? forward GetS,add forward GetM, send save Data, send send PutM_Ack save Data, send PutM_Ack 
to sharers INVs to sharers, clear PutM_Ack/DS add to 
sharers, set owner/ sharers, send 
DM PutM_Ack/ 
DS 





Figure 4.44 Directory controller transitions. 
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In this simplified protocol, the cache controller treats this single message as two 
messages: a Data message, followed by a Last Ack event. The Data message is 
processed first, saving the data and transitioning to IM“. The Last Ack event is 
then processed, transitioning to state M. Finally, the write can be performed in 
state M. 

If the GetM finds the directory in state DS (Directory Shared), the directory 
will send Invalidate (INV) messages to all nodes on the sharers list, send Data to 
the requester with the number of sharers, and transition to state M. When the INV 
messages arrive at the sharers, they will either find the block in state S or state I 
(if they have silently invalidated the block). In either case, the sharer will send an 
ACK directly to the requesting node. The requester will count the Acks it has 
received and compare that to the number sent back with the Data message. When 
all the Acks have arrived, the Last Ack event occurs, triggering the cache to tran- 
sition to state M and allowing the write to proceed. Note that it is possible for all 
the Acks to arrive before the Data message, but not for the Last Ack event to 
occur. This is because the Data message contains the ack count. Thus the protocol 
assumes that the Data message is processed before the Last Ack event. 


[10/10/10/10/10/10] <4.4> Consider the advanced directory protocol described 
above and the cache contents from Figure 4.20. What are the sequence of tran- 
sient states that the affected cache blocks move through in each of the following 
cases? 


a. [10]<4.4>P0: read 100 
b. [10]<4.4>P0: read 120 
c. [10]<4.4>P0: write 120 <-- 80 
d. [10]<4.4>P15: write 120 <-- 80 
e. [10]<4.4>P1: read 110 


f. [10]<4.4>P0: write 108 <— 48 


[15/15/15/15/15/15/15] <4.4> Consider the advanced directory protocol 
described above and the cache contents from Figure 4.42. What are the sequence 
of transient states that the affected cache blocks move through in each of the fol- 
lowing cases? In all cases, assume that the processors issue their requests in the 
same cycle, but the directory orders the requests in top-down order. Assume that 
the controllers' actions appear to be atomic (e.g., the directory controller will per- 
form all the actions required for the DS --> DM transition before handling 
another request for the same block). 


a. [15]<4.4> PO: read 120 

Pl: read 120 
b. [15]<4.4> PO: read 120 

Pl: write 120 <-- 80 
c. [15]<4.4> PO: write 120 

Pl: read 120 
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d. [15]<4.4> PO: write 120 <-- 80 


Pl: write 120 <-- 90 


e. [15]<4.4> PO: replace 110 


Pl: read 110 


f [15]<4.4> Pl: write 110 <— 80 


PO: replace 110 


g. [15] <44> Pl: read 110 


PO: replace 110 


[20/20/20/20/20] <4.4> For the multiprocessor illustrated in Figure 4.42 imple- 
menting the protocol described in Figure 4.43 and Figure 4.44, assume the follow- 
ing latencies: 


CPU read and write hits generate no stall cycles. 


Completing a miss (i.e., do Read and do Write) takes Lack cycles only if it 
is performed in response to the Last Ack event (otherwise it gets done 
while the data is copied to cache). 


A CPU read or write that generates a replacement event issues the corre- 


sponding GetShared or GetModified message before the PutModified 
message (e.g., using a writeback buffer). 


A cache controller event that sends a request or acknowledgment message 
(e.g., GetShared) has latency Lena msg cycles. 


A cache controller event that reads the cache and sends a data message has 
latency Lsend—data Cycles. 


A cache controller event that receives a data message and updates the 
cache has latency Lrey data- 


A memory controller incurs Lsend ms latency when it forwards a request 
message. 


A memory controller incurs an additional Liny cycles for each invalidate 
that it must send. 


A cache controller incurs latency Lsend msg for each invalidate that it re- 
ceives (latency is until it sends the Ack message). 


A memory controller has latency Lead raemory Cycles to read memory and 
send a data message. 


A memory controller has latency Lyrite memory to write a data message to 
memory (latency is until it sends the Ack message). 


A nondata message (e.g., request, invalidate, Ack) has network latency 
L cycles 


reg_msg 


A data message has network latency Laata msg cycles. 


Consider an implementation with the performance characteristics summarized in 
Figure 4.45. 
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Implementation 1 

Action Latency 
send_msg 6 
send_data 20 
rcv_data 15 
read_memory 100 
write_memory 20 
inv 

ack 4 
reqjnsg 15 
data_msg 30 





Figure 4.45 Directory coherence latencies. 


For the sequences of operations below, the cache contents of Figure 4.42, and the 
directory protocol above, what is the latency observed by each processor node? 


a. [20]<4.4>P0: read 100 
b. [20]<4.4>P0: read 128 

c. [20]<4.4>P0: write 128 <— 68 
d. [20]<4.4>P0: write 120 <-- 50 
e. [20]<4.4>P0: write 108 <-- 80 


[20] <4.4> In the case of a cache miss, both the switched snooping protocol 
described earlier and the directory protocol in this case study perform the read or 
write operation as soon as possible. In particular, they do the operation as part of 
the transition to the stable state, rather than transitioning to the stable state and 
simply retrying the operation. This is not an optimization. Rather, to ensure for- 
ward progress, protocol implementations must ensure that they perform at least 
one CPU operation before relinquishing a block. 


Suppose the coherence protocol implementation didn't do this. Explain how this 
might lead to livelock. Give a simple code example that could stimulate this 
behavior. 


[20/30] <4.4> Some directory protocols add an Owned (O) state to the protocol, 
similar to the optimization discussed for snooping protocols. The Owned state 
behaves like the Shared state, in that nodes may only read Owned blocks. But it 
behaves like the Modified state, in that nodes must supply data on other nodes’ Get 
requests to Owned blocks. The Owned state eliminates the case where a GetShared 
request to a block in state Modified requires the node to send the data bom to the 
requesting processor and to the memory. In a MOSI directory protocol, a Get- 
Shared request to a block in either the Modified or Owned states supplies data to 
the requesting node and transitions to the Owned state. A GetModified request in 
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state Owned is handled like a request in state Modified. This optimized MOSI pro- 

tocol only updates memory when a node replaces a block in state Modified or 

Owned. 

a. [20] <4.4> Explain why the MS state in the protocol is essentially a "tran- 
sient" Owned state. 

b. [30] <4.4> Modify the cache and directory protocol tables to support a stable 
Owned state. 

[25/25] <4.4> The advanced directory protocol described above relies on a point- 

to-point ordered interconnect to ensure correct operation. Assuming the initial 

cache contents of Figure 4.42 and the following sequences of operations, explain 

what problem could arise if the interconnect failed to maintain point-to-point 

ordering. Assume that the processors perform the requests at the same time, but 

they are processed by the directory in the order shown. 


a. [25]<4.4> Pl: read 110 

P15: write 110 <-- 90 
b. [25]<4.4> Pl: read 110 

PO: replace 110 
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Memory Hierarchy 
Design 


Ideally one would desire an indefinitely large memory capacity such 
that any particular ...word would be immediately available....We 
are ...forced to recognize the possibility of constructing a hierarchy of 
memories, each of which has greatercapacity than the preceding but 
which is less quickly accessible. 
A.W.Burks, H.H.Goldstine, 
and J. von Neumann 
Preliminary Discussion of the 
Logical Design of an Electronic 
Computing Instrument (1946) 


Five Memory Hierarchy Design 


5.1 


Introduction 


Computer pioneers correctly predicted that programmers would want unlimited 
amounts of fast memory. An economical solution to that desire is a memory hier- 
archy, which takes advantage of locality and cost-performance of memory 
technologies. The principle of locality, presented in the first chapter, says that 
most programs do not access all code or data uniformly. Locality occurs in time 
(temporal locality) and in space (spatial locality). This principle, plus the guide- 
line that smaller hardware can be made faster, led to hierarchies based on memo- 
ries of different speeds and sizes. Figure 5.1 shows a multilevel memory 
hierarchy, including typical sizes and speeds of access. 

Since fast memory is expensive, a memory hierarchy is organized into several 
levels—each smaller, faster, and more expensive per byte than the next lower 
level. The goal is to provide a memory system with cost per byte almost as low as 
the cheapest level of memory and speed almost as fast as the fastest level. 

Note that each level maps addresses from a slower, larger memory to a 
smaller but faster memory higher in the hierarchy. As part of address mapping, 
the memory hierarchy is given the responsibility of address checking; hence, pro- 
tection schemes for scrutinizing addresses are also part of the memory hierarchy. 

The importance of the memory hierarchy has increased with advances in per- 
formance of processors. Figure 5.2 plots processor performance projections 
against the historical performance improvement in time to access main memory. 
Clearly, computer architects must try to close the processor-memory gap. 

The increasing size and thus importance of this gap led to the migration of the 
basics of memory hierarchy into undergraduate courses in computer architecture, 
and even to courses in operating systems and compilers. Thus, we'll start with a 
quick review of caches. The bulk of the chapter, however, describes more 
advanced innovations that address the processor-memory performance gap. 

When a word is not found in the cache, the word must be fetched from the 
memory and placed in the cache before continuing. Multiple words, called a 
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Figure 5.1 The levels in a typical memory hierarchy in embedded, desktop, and 
server computers. As we move farther away from the processor, the memory in the 
level below becomes slower and larger. Note that the time units change by factors of 
10—from picoseconds to milliseconds—and that the size units change by factors of 
1000—from bytes to terabytes. 
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Figure 5.2 Starting with 1980 performance as a baseline, the gap in performance 
between memory and processors is plotted over time. Note that the vertical axis 
must be on a logarithmic scale to record the size of the processor-DRAM performance 
gap. The memory baseline is 64 KB DRAM in 1980, with a 1.07 per year performance 
improvement in latency (see Figure 5.13 on page 313). The processor line assumes a 
125 improvement per year until 1986, and a 1.52 improvement until 2004, and a 1.20 
improvement thereafter; see Figure 1.1 in Chapter 1. 


block (or line), are moved for efficiency reasons. Each cache block includes a tag 
to see which memory address it corresponds to. 

A key design decision is where blocks (or lines) can be placed in a cache. The 
most popular scheme is set associative, where a set is a group of blocks in the 
cache. A block is first mapped onto a set, and then the block can be placed any- 
where within that set. Finding a block consists of first mapping the block address 
to the set, and then searching the set—usually in parallel—to find the block. The 
set is chosen by the address of the data: 


(Block address) MOD (Number of sets in cache) 


If there are n blocks in a set, the cache placement is called n-way set associative. 
The end points of set associativity have their own names. A direct-mapped cache 
has just one block per set (so a block is always placed in the same location), and a 
fully associative cache has just one set (so a block can be placed anywhere). 
Caching data that is only read is easy, since the copy in the cache and mem- 
ory will be identical. Caching writes is more difficult: how can the copy in the 
cache and memory be kept consistent? There are two main strategies. A write- 
through cache updates the item in the cache and writes through to update main 
memory. A write-back cache only updates the copy in the cache. When the block 
is about to be replaced, it is copied back to memory. Both write strategies can use 
a write buffer to allow the cache to proceed as soon as the data is placed in the 
buffer rather than wait the full latency to write the data into memory. 
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One measure of the benefits of different cache organizations is miss rate. Miss 
rate is simply the fraction of cache accesses that result in a miss—that is, the 
number of accesses that miss divided by the number of accesses. 

To gain insights into the causes of high miss rates, which can inspire better 
cache designs, the three Cs model sorts all misses into three simple categories: 


e Compulsory—tThe very first access to a block cannot be in the cache, so the 
block must be brought into the cache. Compulsory misses are those that occur 
even if you had an infinite cache. 


e Capacity—If the cache cannot contain all the blocks needed during execution 
of a program, capacity misses (in addition to compulsory misses) will occur 
because of blocks being discarded and later retrieved. 


e Conflict—If the block placement strategy is not fully associative, conflict 
misses (in addition to compulsory and capacity misses) will occur because a 
block may be discarded and later retrieved if conflicting blocks map to its set. 


Figures C.8 and C.9 on pages C-23 and C-24 show the relative frequency of 
cache misses broken down by the "three Cs." (Chapter 4 adds a fourth C, for 
Coherency misses due to cache flushes to keep multiple caches coherent in a mul- 
tiprocessor; we won't consider those here.) 

Alas, miss rate can be a misleading measure for several reasons. Hence, some 
designers prefer measuring misses per instruction rather than misses per memory 
reference (miss rate). These two are related: 


Misses Miss rate x Memory accesses m 
— = - = = Miss rate x - 
Instruction Instruction count Instruction 


Memory accesses 








(It is often reported as misses per 1000 instructions to use integers instead of frac- 
tions.) For speculative processors, we only count instructions that commit. 

The problem with both measures is that they don't factor in the cost of a miss. 
A better measure is the average memory access time: 


Average memory access time = Hit time + Miss rate X Miss penalty 


where Hit time is the time to hit in the cache and Miss penalty is the time to 
replace the block from memory (that is, the cost of a miss). Average memory 
access time is still an indirect measure of performance; although it is a better 
measure than miss rate, it is not a substitute for execution time. For example, in 
Chapter 2 we saw that speculative processors may execute other instructions dur- 
ing a miss, thereby reducing the effective miss penalty. 

If this material is new to you, or if this quick review moves too quickly, see 
Appendix C. It covers the same introductory material in more depth and includes 
examples of caches from real computers and quantitative evaluations of their 
effectiveness. 

Section C.3 in Appendix C also presents six basic cache optimizations, which 
we quickly review here. The appendix also gives quantitative examples of the bene- 
fits of these optimizations. 
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1. Larger block size to reduce miss rate—The simplest way to reduce the miss 
rate is to take advantage of spatial locality and increase the block size. Note 
that larger blocks also reduce compulsory misses, but they also increase the 
miss penalty. 

2. Bigger caches to reduce miss rate—The obvious way to reduce capacity 
misses is to increase cache capacity. Drawbacks include potentially longer hit 
time of the larger cache memory and higher cost and power. 


3. Higher associativity to reduce miss rate—Obviously, increasing associativity 
reduces conflict misses. Greater associativity can come at the cost of 
increased hit time. 


4. Multilevel caches to reduce miss penalty—A difficult decision is whether to 
make the cache hit time fast, to keep pace with the increasing clock rate of 
processors, or to make the cache large, to overcome the widening gap 
between the processor and main memory. Adding another level of cache 
between the original cache and memory simplifies the decision (see Figure 
5.3). The first-level cache can be small enough to match a fast clock cycle 
time, yet the second-level cache can be large enough to capture many 
accesses that would go to main memory. The focus on misses in second-level 
caches leads to larger blocks, bigger capacity, and higher associativity. If LI 
and L2 refer, respectively, to first- and second-level caches, we can redefine 
the average memory access time: 


Hit timez; + Miss ratez; x (Hit timer2 + Miss rater2 x Miss penalty 2) 


5. Giving priority to read misses over writes to reduce miss penalty—A write 
buffer is a good place to implement this optimization. Write buffers create 
hazards because they hold the updated value of a location needed on a read 
miss—that is, a read-after-write hazard through memory. One solution is to 
check the contents of the write buffer on a read miss. If there are no conflicts, 
and if the memory system is available, sending the read before the writes 
reduces the miss penalty. Most processors give reads priority over writes. 


6. Avoiding address translation during indexing of the cache to reduce hit 
time—Caches must cope with the translation of a virtual address from the 
processor to a physical address to access memory. (Virtual memory is cov- 
ered in Sections 5.4 and C.4.) Figure 5.3 shows a typical relationship between 
caches, translation lookaside buffers (TLBs), and virtual memory. A common 
optimization is to use the page offset—the part that is identical in both virtual 
and physical addresses—to index the cache. The virtual part of the address is 
translated while the cache is read using that index, so the tag match can use 
physical addresses. This scheme allows the cache read to begin immediately, 
and yet the tag comparison still uses physical addresses. The drawback of this 
virtually indexed, physically tagged optimization is that the size of the page 
limits the size of the cache. For example, a direct-mapped cache can be no 
bigger than the page size. Higher associativity can keep the cache index in the 
physical part of the address and yet still support a cache larger than a page. 


292 œ Chapter Five Memory Hierarchy Design 


Virtual address <64> 


Virtual page number <51> Page offset <13> 


lock offset <6> 





















L1 cache index <7> 





TLB index <8> 





TLB tag compare address <43> 


TLB tag <43> TLB data <27> 


L1 tag compare address <27> 















L1 cache tag <43> L1 data <512> 
























L2 tag compare address <19> 








L2 cache index <16> | Block offset <6> 
To CPU 
> 


L2 data <512> 














L2 cache tag <19> 


- —r 


To L1 cache or CPU 


Figure 5.3 The overall picture of a hypothetical memory hierarchy going from virtual address to L2 cache 
access. The page size is 8 KB. The TLB is direct mapped with 256 entries. The L1 cache is a direct-mapped 8 KB, and 
the L2 cache is a direct-mapped 4 MB. Both use 64-byte blocks. The virtual address is 64 bits and the physical address 
is 40 bits. The primary difference between this figure and a real memory hierarchy, as in Figure 5.18 on page 327, is 
higher associativity for caches and TLBs and a smaller virtual address than 64 bits. 


For example, doubling associativity while doubling the cache size maintains 
the size of the index, since it is controlled by this formula: 


windex _ Cache size 
Block size x Set associativity 


A seemingly obvious alternative is to just use virtual addresses to access the 
cache, but this can cause extra overhead in the operating system. 


Note that each of these six optimizations above has a potential disadvantage 
that can lead to increased, rather than decreased, average memory access time. 


5.2 
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The rest of this chapter assumes familiarity with the material above, including 
Figure 5.3. To put cache ideas into practice, throughout this chapter (and Appen- 
dix C) we show examples from the memory hierarchy of the AMD Opteron 
microprocessor. Toward the end of the chapter, we evaluate the impact of this 
hierarchy on performance using the SPEC2000 benchmark programs. 

The Opteron is a microprocessor designed for desktops and servers. Even 
these two related classes of computers have different concerns in a memory hier- 
archy. Desktop computers are primarily running one application at a time on top 
of an operating system for a single user, whereas server computers may have 
hundreds of users running potentially dozens of applications simultaneously. 
These characteristics result in more context switches, which effectively increase 
miss rates. Thus, desktop computers are concerned more with average latency 
from the memory hierarchy, whereas server computers are also concerned about 
memory bandwidth. 


Eleven Advanced Optimizations of Cache Performance 


The average memory access time formula above gives us three metrics for cache 
optimizations: hit time, miss rate, and miss penalty. Given the popularity of super- 
scalar processors, we add cache bandwidth to this list. Hence, we group 11 
advanced cache optimizations into the following categories: 


e Reducing the hit time: small and simple caches, way prediction, and trace 
caches 


e Increasing cache bandwidth: pipelined caches, multibanked caches, and non- 
blocking caches 


e Reducing the miss penalty: critical word first and merging write buffers 
e Reducing the miss rate: compiler optimizations 


e Reducing the miss penalty or miss rate via parallelism: hardware prefetching 
and compiler prefetching 


We will conclude with a summary of the implementation complexity and the per- 
formance benefits of the 11 techniques presented (Figure 5.11 on page 309). 


First Optimization: Small and Simple Caches to Reduce Hit Time 


A time-consuming portion of a cache hit is using the index portion of the address 
to read the tag memory and then compare it to the address. Smaller hardware can 
be faster, so a small cache can help the hit time. It is also critical to keep an L2 
cache small enough to fit on the same chip as the processor to avoid the time pen- 
alty of going off chip. 

The second suggestion is to keep the cache simple, such as using direct map- 
ping. One benefit of direct-mapped caches is that the designer can overlap the tag 
check with the transmission of the data. This effectively reduces hit time. 
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Hence, the pressure of a fast clock cycle encourages small and simple cache 
designs for first-level caches. For lower-level caches, some designs strike a com- 
promise by keeping the tags on chip and the data off chip, promising a fast tag 
check, yet providing the greater capacity of separate memory chips. 

Although the amount of on-chip cache increased with new generations of 
microprocessors, the size of the LI caches has recently not increased between 
generations. The LI caches are the same size for three generations of AMD 
microprocessors: K6, Athlon, and Opteron. The emphasis is on fast clock rate 
while hiding LI misses with dynamic execution and using L2 caches to avoid 
going to memory. 

One approach to determining the impact on hit time in advance of building a 
chip is to use CAD tools. CACTI is a program to estimate the access time of 
alternative cache structures on CMOS microprocessors within 10% of more 
detailed CAD tools. For a given minimum feature size, it estimates the hit time of 
caches as you vary cache size, associativity, and number of read/write ports. Fig- 
ure 5.4 shows the estimated impact on hit time as cache size and associativity are 
varied. Depending on cache size, for these parameters the model suggests that hit 
time for direct mapped is 1.2-1.5 times faster than two-way set associative; two- 
way is 1.02-1.11 times faster than four-way; and four-way is 1.0-1.08 times 
faster than fully associative. 
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Figure 5.4 Access times as size and associativity vary in a CMOS cache. These data 
are based on the CACTI model 4 4.0 by Tarjan, Thoziyoor, and Jouppi [2006]. They 
assumed 90 nm feature size, a single bank, and 64-byte blocks. The median ratios of 
access time relative to the direct-mapped caches are 1.32,1.39, and 1.43 for 2-way, 4- 
way, and 8-way associative caches, respectively. 


Example 


Answer 
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Assume that the hit time of a two-way set-associative first-level data cache is 1.1 
times faster than a four-way set-associative cache of the same size. The miss rate 
falls from 0.049 to 0.044 for an 8 KB data cache, according to Figure C.8 in 
Appendix C. Assume a hit is 1 clock cycle and that the cache is the critical path 
for the clock. Assume the miss penalty is 10 clock cycles to the L2 cache for the 
two-way set-associative cache, and that the L2 cache does not miss. Which has 
the faster average memory access time? 


For the two-way cache: 


Average memory access timez way = Hit time + Miss rate x Miss penalty 
1+0.049x10 = 149 


For the four-way cache, the clock time is 1.1 times longer. The elapsed time of 
the miss penalty should be the same since it's not affected by the processor clock 
rate, so assume it takes 9 of the longer clock cycles: 


Average memory access time, = Hit time x 1.1 + Miss rate x Miss penalty 
= 11+0.044x9 = 150 


If it really stretched the clock cycle time by a factor of 1.1, the performance 
impact would be even worse than indicated by the average memory access time, 
as the clock would be slower even when the processor is not accessing the cache. 


Despite this advantage, since many processors take at least 2 clock cycles to 
access the cache, LI caches today are often at least two-way associative. 


Second Optimization: Way Prediction to Reduce Hit Time 


Another approach reduces conflict misses and yet maintains the hit speed of 
direct-mapped cache. In way prediction, extra bits are kept in the cache to predict 
the way, or block within the set of the next cache access. This prediction means 
the multiplexor is set early to select the desired block, and only a single tag com- 
parison is performed that clock cycle in parallel with reading the cache data. A 
miss results in checking the other blocks for matches in the next clock cycle. 

Added to each block of a cache are block predictor bits. The bits select which 
of the blocks to try on the next cache access. If the predictor is correct, the cache 
access latency is the fast hit time. If not, it tries the other block, changes the way 
predictor, and has a latency of one extra clock cycle. Simulations suggested set 
prediction accuracy is in excess of 85% for a two-way set, so way prediction 
saves pipeline stages more than 85% of the time. Way prediction is a good match 
to speculative processors, since they must already undo actions when speculation 
is unsuccessful. The Pentium 4 uses way prediction. 
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Third Optimization: Trace Caches to Reduce Hit Time 


A challenge in the effort to find lots of instruction-level parallelism is to find 
enough instructions every cycle without use dependencies. To address this chal- 
lenge, blocks in a trace cache contain dynamic traces of the executed instructions 
rather than static sequences of instructions as determined by layout in memory. 
Hence, the branch prediction is folded into the cache and must be validated along 
with the addresses to have a valid fetch. 

Clearly, trace caches have much more complicated address-mapping mecha- 
nisms, as the addresses are no longer aligned to power-of-two multiples of the 
word size. However, they can better utilize long blocks in the instruction cache. 
Long blocks in conventional caches may be entered in the middle from a branch 
and exited before the end by a branch, so they can have poor space utilization. 
The downside of trace caches is that conditional branches making different 
choices result in the same instructions being part of separate traces, which each 
occupy space in the trace cache and lower its space efficiency. 

Note that the trace cache of the Pentium 4 uses decoded micro-operations, 
which acts as another performance optimization since it saves decode time. 

Many optimizations are simple to understand and are widely used, but a trace 
cache is neither simple nor popular. It is relatively expensive in area, power, and 
complexity compared to its benefits, so we believe trace caches are likely a one- 
time innovation. We include them because they appear in the popular Pentium 4. 


Fourth Optimization: Pipelined Cache Access to Increase 
Cache Bandwidth 


This optimization is simply to pipeline cache access so that the effective latency 
of a first-level cache hit can be multiple clock cycles, giving fast clock cycle time 
and high bandwidth but slow hits. For example, the pipeline for the Pentium took 
1 clock cycle to access the instruction cache, for the Pentium Pro through Pen- 
tium III it took 2 clocks, and for the Pentium 4 it takes 4 clocks. This split 
increases the number of pipeline stages, leading to greater penalty on mispre- 
dicted branches and more clock cycles between the issue of the load and the use 
of the data (see Chapter 2). 


Fifth Optimization: Nonblocking Caches to Increase Cache 
Bandwidth 


For pipelined computers that allow out-of-order completion (Chapter 2), the pro- 
cessor need not stall on a data cache miss. For example, the processor could con- 
tinue fetching instructions from the instruction cache while waiting for the data 
cache to return the missing data. A nonblocking cache or lockup-free cache esca- 
lates the potential benefits of such a scheme by allowing the data cache to con- 
tinue to supply cache hits during a miss. This "hit under miss" optimization 


Example 
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Figure 5.5 Ratio of the average memory stall time for a blocking cache to hit-under- 
miss schemes as the number of outstanding misses is varied for 18 SPEC92 pro- 
grams. The hit-under-64-misses line allows one miss for every register in the processor. 
The first 14 programs are floating-point programs: the average for hit under 1 miss is 
76%, for 2 misses is 51%, and for 64 misses is 39%.The final four are integer programs, 
and the three averages are 81%, 78%, and 78%, respectively. These data were collected 
for an 8 KB direct-mapped data cache with 32-byte blocks and a 16-clock-cycle miss 
penalty, which today would imply a second-level cache. These data were generated 
using the VLIW Multiflow compiler, which scheduled loads away from use [Farkas and 
Jouppi 1994]. Although it may be a good model for L1 misses to L2 caches, it would be 
interesting to redo this experiment with SPEC2006 benchmarks and modern assump- 
tions on miss penalty. 


reduces the effective miss penalty by being helpful during a miss instead of 
ignoring the requests of the processor. A subtle and complex option is that the 
cache may further lower the effective miss penalty if it can overlap multiple 
misses: a "hit under multiple miss" or "miss under miss" optimization. The sec- 
ond option is beneficial only if the memory system can service multiple misses. 

Figure 5.5 shows the average time in clock cycles for cache misses for an 
8 KB data cache as the number of outstanding misses is varied. Floating-point 
programs benefit from increasing complexity, while integer programs get 
almost all of the benefit from a simple hit-under-one-miss scheme. As pointed 
out in Chapter 3, the number of simultaneous outstanding misses limits achiev- 
able instruction-level parallelism in programs. 


Which is more important for floating-point programs: two-way set associativity 
or hit under one miss? What about integer programs? Assume the following aver- 
age miss rates for 8 KB data caches: 11.4% for floating-point programs with a 
direct-mapped cache, 10.7% for these programs with a two-way set-associative 
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Answer 


cache, 7.4% for integer programs with a direct-mapped cache, and 6.0% for inte- 
ger programs with a two-way set-associative cache. Assume the average memory 
stall time is just the product of the miss rate and the miss penalty and the cache 
described in Figure 5.5, which we assume has a L2 cache. 


The numbers for Figure 5.5 were based on a miss penalty of 16 clock cycles 
assuming an L2 cache. Although the programs are older and this is low for a miss 
penalty, let's stick with it for consistency. (To see how well it would work on 
modern programs and miss penalties, we'd need to redo this experiment.) For 
floating-point programs, the average memory stall times are 


Miss ratepy x Miss penalty = 11.4% x 16 = 1.84 


Miss ratez.way x Miss penalty = 10.7% x 16 = 1.71 


The memory stalls for two-way are thus 1.71/1.84 or 93% of direct-mapped 
cache. The caption of Figure 5.5 says hit under one miss reduces the average 
memory stall time to 76% of a blocking cache. Hence, for floating-point pro- 
grams, the direct-mapped data cache supporting hit under one miss gives better 
performance than a two-way set-associative cache that blocks on a miss. 

For integer programs the calculation is 


Miss ratepm x Miss penalty = 7.4% x 16= 1.18 
Miss ratez. way X Miss penalty = 6.0% x 16 = 0.96 


The memory stalls of two-way are thus 0.96/1.18 or 81% of direct-mapped 
cache. The caption of Figure 5.5 says hit under one miss reduces the average 
memory stall time to 81% of a blocking cache, so the two options give about the 
same performance for integer programs using this data. 


The real difficulty with performance evaluation of nonblocking caches is that 
a cache miss does not necessarily stall the processor. In this case, it is difficult to 
judge the impact of any single miss, and hence difficult to calculate the average 
memory access time. The effective miss penalty is not the sum of the misses but 
the nonoverlapped time that the processor is stalled. The benefit of nonblocking 
caches is complex, as it depends upon the miss penalty when there are multiple 
misses, the memory reference pattern, and how many instructions the processor 
can execute with a miss outstanding. 

In general, out-of-order processors are capable of hiding much of the miss 
penalty of an LI data cache miss that hits in the L2 cache, but are not capable of 
hiding a significant fraction of an L2 cache miss. 


Sixth Optimization: Multibanked Caches to Increase Cache 
Bandwidth 


Rather than treat the cache as a single monolithic block, we can divide it into 
independent banks that can support simultaneous accesses. Banks were origi- 
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Figure 5.6 Four-way interleaved cache banks using block addressing. Assuming 64 
bytes per blocks, each of these addresses would be multiplied by 64 to get byte 
addressing. 


nally used to improve performance of main memory and are now used inside 
modern DRAM chips as well as with caches. The L2 cache of the AMD 
Opteron has two banks, and the L2 cache of the Sun Niagara has four banks. 

Clearly, banking works best when the accesses naturally spread themselves 
across the banks, so the mapping of addresses to banks affects the behavior of 
the memory system. A simple mapping that works well is to spread the 
addresses of the block sequentially across the banks, called sequential inter- 
leaving. For example, if there are four banks, bank 0 has all blocks whose 
address modulo 4 is 0; bank 1 has all blocks whose address modulo 4 is 1; and 
so on. Figure 5.6 shows this interleaving. 


Seventh Optimization: Critical Word First and Early Restart to 
Reduce Miss Penalty 


This technique is based on the observation that the processor normally needs just 
one word of the block at a time. This strategy is impatience: Don't wait for the full 
block to be loaded before sending the requested word and restarting the processor. 
Here are two specific strategies: 


e Critical word first—Request the missed word first from memory and send it 
to the processor as soon as it arrives; let the processor continue execution 
while filling the rest of the words in the block. 


e Early restart—Fetch the words in normal order, but as soon as the requested 
word of the block arrives, send it to the processor and let the processor con- 
tinue execution. 


Generally, these techniques only benefit designs with large cache blocks, 
since the benefit is low unless blocks are large. Note that caches normally con- 
tinue to satisfy accesses to other blocks while the rest of the block is being filled. 

Alas, given spatial locality, there is a good chance that the next reference is 
to the rest of the block. Just as with nonblocking caches, the miss penalty is not 
simple to calculate. When there is a second request in critical word first, the 
effective miss penalty is the nonoverlapped time from the reference until the 
second piece arrives. 
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Example 


Answer 


Let's assume a computer has a 64-byte cache block, an L2 cache that takes 7 
clock cycles to get the critical 8 bytes, and then 1 clock cycle per 8 bytes + 1 
extra clock cycle to fetch the rest of the block. (These parameters are similar to 
the AMD Opteron.) Without critical word first, it's 8 clock cycles for the first 8 
bytes and then 1 clock per 8 bytes for the rest of the block. Calculate the average 
miss penalty for critical word first, assuming that there will be no other accesses 
to the rest of the block until it is completely fetched. Then calculate assuming the 
following instructions read data 8 bytes at a time from the rest of the block. Com- 
pare the times with and without critical word first. 


The average miss penalty is 7 clock cycles for critical word first, and without crit- 
ical word first it takes 8 + (8-1) xlorl5 clock cycles for the processor to read 
a full cache block. Thus, for one word, the answer is 15 versus 7 clock cycles. 
The Opteron issues two loads per clock cycle, so it takes 8/2 or 4 clocks to issue 
the loads. Without critical word first, it would take 19 clock cycles to load and 
read the full block. With critical word first, it's 7 + 7x 1 + 1 or 15 clock cycles to 
read the whole block, since the loads are overlapped in critical word first. For the 
full block, the answer is 19 versus 15 clock cycles. 


As this example illustrates, the benefits of critical word first and early restart 
depend on the size of the block and the likelihood of another access to the portion 
of the block that has not yet been fetched. 


Eighth Optimization: Merging Write Buffer to Reduce 
Miss Penalty 


Write-through caches rely on write buffers, as all stores must be sent to the next 
lower level of the hierarchy. Even write-back caches use a simple buffer when a 
block is replaced. If the write buffer is empty, the data and the full address are 
written in the buffer, and the write is finished from the processor's perspective; 
the processor continues working while the write buffer prepares to write the word 
to memory. If the buffer contains other modified blocks, the addresses can be 
checked to see if the address of this new data matches the address of a valid write 
buffer entry. If so, the new data are combined with that entry. Write merging is 
the name of this optimization. The Sun Niagara processor, among many others, 
uses write merging. 

If the buffer is full and there is no address match, the cache (and processor) 
must wait until the buffer has an empty entry. This optimization uses the memory 
more efficiently since multiword writes are usually faster than writes performed 
one word at a time. Skadron and Clark [1997] found that about 5% to 10% of per- 
formance was lost due to stalls in a four-entry write buffer. 

The optimization also reduces stalls due to the write buffer being full. Figure 
5.7 shows a write buffer with and without write merging. Assume we had four 
entries in the write buffer, and each entry could hold four 64-bit words. Without 
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Figure 5.7 To illustrate write merging, the write buffer on top does not use it while 
the write buffer on the bottom does.The four writes are merged into a single buffer 
entry with write merging; without it, the buffer is full even though three-fourths of each 
entry is wasted.The buffer has four entries, and each entry holds four 64-bit words. The 
address for each entry is on the left, with a valid bit (V) indicating whether the next 
sequential 8 bytes in this entry are occupied. (Without write merging, the words to the 
right in the upper part of the figure would only be used for instructions that wrote mul- 
tiple words at the same time.) 


this optimization, four stores to sequential addresses would fill the buffer at one 
word per entry, even though these four words when merged exactly fit within a 
single entry of the write buffer. 

Note that input/output device registers are often mapped into the physical 
address space. These I/O addresses cannot allow write merging because separate 
TO registers may not act like an array of words in memory. For example, they 
may require one address and data word per register rather than multiword writes 
using a single address. 

In a write-back cache, the block that is replaced is sometimes called the vic- 
tim. Hence, the AMD Opteron calls its write buffer a victim buffer. The write vic- 
tim buffer or victim buffer contains the dirty blocks that are discarded from a 
cache because of a miss. Rather than stall on a subsequent cache miss, the con- 
tents of the buffer are checked on a miss to see if they have the desired data 
before going to the next lower-level memory. This name makes it sounds like 
another optimization called a victim cache. In contrast, the victim cache can 
include any blocks discarded from the cache on a miss, whether they are dirty or 
not [Jouppi 1990]. 

While the purpose of the write buffer is to allow the cache to proceed without 
waiting for dirty blocks to write to memory, the goal of a victim cache is to reduce 
the impact of conflict misses. Write buffers are far more popular today than victim 
caches, despite the confusion caused by the use of "victim" in their title. 
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Ninth Optimization: Compiler Optimizations to Reduce 
Miss Rate 


Thus far, our techniques have required changing the hardware. This next tech- 
nique reduces miss rates without any hardware changes. 

This magical reduction comes from optimized software—the hardware 
designer's favorite solution! The increasing performance gap between processors 
and main memory has inspired compiler writers to scrutinize the memory hier- 
archy to see if compile time optimizations can improve performance. Once again, 
research is split between improvements in instruction misses and improvements 
in data misses. The optimizations presented below are found in many modern 
compilers. 


Code and Data Rearrangement 


Code can easily be rearranged without affecting correctness; for example, 
reordering the procedures of a program might reduce instruction miss rates by 
reducing conflict misses [McFarling 1989]. Another code optimization aims for 
better efficiency from long cache blocks. Aligning basic blocks so that the entry 
point is at the beginning of a cache block decreases the chance of a cache miss 
for sequential code. If the compiler knows that a branch is likely to be taken, it 
can improve spatial locality by changing the sense of the branch and swapping 
the basic block at the branch target with the basic block sequentially after the 
branch. Branch straightening is the name of this optimization. 

Data have even fewer restrictions on location than code. The goal of such 
transformations is to try to improve the spatial and temporal locality of the data. 
For example, array calculations—the cause of most misses in scientific codes— 
can be changed to operate on all data in a cache block rather than blindly striding 
through arrays in the order that the programmer wrote the loop. 

To give a feeling of this type of optimization, we will show two examples, 
transforming the C code by hand to reduce cache misses. 


Loop Interchange 


Some programs have nested loops that access data in memory in nonsequential 
order. Simply exchanging the nesting of the loops can make the code access the 
data in the order they are stored. Assuming the arrays do not fit in the cache, this 
technique reduces misses by improving spatial locality; reordering maximizes use 
of data in a cache block before they are discarded. 


/* Before */ 
for (j = 0; j < 100; j = j+1) 
for (i = 0; i < 5000; i = i+1) 
x[i][j] = 2 * x[i][j]; 
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/* After */ 

for (i = 0; i < 5000; i = i+!) 
for (j = 0; j < 100; j = j+1) 
x[i][j] = 2 * x[i][j]; 


The original code would skip through memory in strides of 100 words, while the 
revised version accesses all the words in one cache block before going to the next 
block. This optimization improves cache performance without affecting the num- 
ber of instructions executed. 


Blocking 


This optimization improves temporal locality to reduce misses. We are again 
dealing with multiple arrays, with some arrays accessed by rows and some by 
columns. Storing the arrays row by row (row major order) or column by column 
(column major order) does not solve the problem because both rows and columns 
are used in every loop iteration. Such orthogonal accesses mean that trans- 
formations such as loop interchange still leave plenty of room for improvement. 

Instead of operating on entire rows or columns of an array, blocked algorithms 
operate on submatrices or blocks. The goal is to maximize accesses to the data 
loaded into the cache before the data are replaced. The code example below, which 
performs matrix multiplication, helps motivate the optimization: 


/* Before */ 
for (i = 0; i < N; i 
iQ: | 


for (k= 0; k< N; k=k+ l) 
r =r + y[i][k]*z[k][j]; 
x[i] [j] = r; 


The two inner loops read all N-by-N elements of z, read the same N elements in a 
row of y repeatedly, and write one row of N elements of x. Figure 5.8 gives a 
snapshot of the accesses to the three arrays. A dark shade indicates a recent 
access, a light shade indicates an older access, and white means not yet accessed. 

The number of capacity misses clearly depends on N and the size of the cache. 
If it can hold all three N-by-N matrices, then all is well, provided there are no 
cache conflicts. If the cache can hold one N-by-N matrix and one row of N, then at 
least the i th row of y and the array z may stay in the cache. Less than that and 
misses may occur for both x and z. In the worst case, there would be 2N? + N? 
memory words accessed for N? operations. 

To ensure that the elements being accessed can fit in the cache, the original 
code is changed to compute on a submatrix of size B by B. Two inner loops now 
compute in steps of size B rather than the full length of x and z. B is called the 
blocking factor. (Assume x is initialized to zero.) 
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Figure 5.8 A snapshot of the three arrays x,y, and z when N = 6 and i = 1. The age of accesses to the array ele- 
ments is indicated by shade: white means not yet touched, light means older accesses, and dark means newer 
accesses. Compared to Figure 5.9, elements ofy and z are read repeatedly to calculate new elements of x.The vari- 
ables i, j,and k are shown along the rows or columns used to access the arrays. 


k j 
y z 
4 5 0 1 2 3 4 5 0 5 
0 0 
1 1 
2 2 
i k 
3 3 
4 4 
5 5 





Figure 5.9 The age of accesses to the arrays x,y,and z when B = 3. Note in contrast to Figure 5.8 the smaller num- 


ber of elements accessed. 


/* After */ 
for (jj = 0; jj <N; jj = jj+B) 
for (kk = 0; kk < N; kk = kk+B) 
for (i =0; i<N; i=i+1) 
for (j = jj; J < min(jj+B,N); j = j+1) 
{r= 0; 
for (k = kk; k < min(kk+B,N); k = k + 1) 
r =r + y[i][k]*z[k][j]; 

X[i]([j] = x[i][] + r; 
} 
Figure 5.9 illustrates the accesses to the three arrays using blocking. Looking 


only at capacity misses, the total number of memory words accessed is 2N°B + N’. 
This total is an improvement by about a factor of B. Hence, blocking exploits a 
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combination of spatial and temporal locality, since y benefits from spatial locality 
and z benefits from temporal locality. 

Although we have aimed at reducing cache misses, blocking can also be used 
to help register allocation. By taking a small blocking size such that the block can 
be held in registers, we can minimize the number of loads and stores in the 
program. 


Tenth Optimization: Hardware Prefetching of Instructions and 
Data to Reduce Miss Penalty or Miss Rate 


Nonblocking caches effectively reduce the miss penalty by overlapping execu- 
tion with memory access. Another approach is to prefetch items before the pro- 
cessor requests them. Both instructions and data can be prefetched, either 
directly into the caches or into an external buffer that can be more quickly 
accessed than main memory. 

Instruction prefetch is frequently done in hardware outside of the cache. Typ- 
ically, the processor fetches two blocks on a miss: the requested block and the 
next consecutive block. The requested block is placed in the instruction cache 
when it returns, and the prefetched block is placed into the instruction stream 
buffer. If the requested block is present in the instruction stream buffer, the origi- 
nal cache request is canceled, the block is read from the stream buffer, and the 
next prefetch request is issued. 

A similar approach can be applied to data accesses [Jouppi 1990]. Palacharla 
and Kessler [1994] looked at a set of scientific programs and considered multiple 
stream buffers that could handle either instructions or data. They found that eight 
stream buffers could capture 50% to 70% of all misses from a processor with two 
64 KB four-way set-associative caches, one for instructions and the other for data. 

The Intel Pentium 4 can prefetch data into the second-level cache from up to 
eight streams from eight different 4 KB pages. Prefetching is invoked if there are 
two successive L2 cache misses to a page, and if the distance between those 
cache blocks is less than 256 bytes. (The stride limit is 512 bytes on some models 
of the Pentium 4.) It won't prefetch across a 4 KB page boundary. 

Figure 5.10 shows the overall performance improvement for a subset of 
SPEC2000 programs when hardware prefetching is turned on. Note that this fig- 
ure includes only 2 of 12 integer programs, while it includes the majority of the 
SPEC floating-point programs. 

Prefetching relies on utilizing memory bandwidth that otherwise would be 
unused, but if it interferes with demand misses, it can actually lower perfor- 
mance. Help from compilers can reduce useless prefetching. 


Eleventh Optimization: Compiler-Controlled Prefetching to 
Reduce Miss Penalty or Miss Rate 


An alternative to hardware prefetching is for the compiler to insert prefetch instruc- 
tions to request data before the processor needs it. There are two flavors of prefetch: 
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Figure 5.10 Speedup due to hardware prefetching on Intel Pentium 4 with hard- 
ware prefetching turned on for 2 of 12 SPECint2000 benchmarks and 9 of 14 
SPECfp2000 benchmarks. Only the programs that benefit the most from prefetching 
are shown; prefetching speeds up the missing 15 SPEC benchmarks by less than 15% 
[Singhal 2004]. 


e Register prefetch will load the value into a register. 


e Cache prefetch loads data only into the cache and not the register. 


Either of these can be faulting or nonfaulting; that is, the address does or does 
not cause an exception for virtual address faults and protection violations. Using 
this terminology, a normal load instruction could be considered a "faulting regis- 
ter prefetch instruction." Nonfaulting prefetches simply turn into no-ops if they 
would normally result in an exception, which is what we want. 

The most effective prefetch is "semantically invisible" to a program: It 
doesn't change the contents of registers and memory, and it cannot cause virtual 
memory faults. Most processors today offer nonfaulting cache prefetches. This 
section assumes nonfaulting cache prefetch, also called nonbinding prefetch. 

Prefetching makes sense only if the processor can proceed while prefetching 
the data; that is, the caches do not stall but continue to supply instructions and 
data while waiting for the prefetched data to return. As you would expect, the 
data cache for such computers is normally nonblocking. 

Like hardware-controlled prefetching, the goal is to overlap execution with 
the prefetching of data. Loops are the important targets, as they lend themselves 
to prefetch optimizations. If the miss penalty is small, the compiler just unrolls 
the loop once or twice, and it schedules the prefetches with the execution. If the 
miss penalty is large, it uses software pipelining (see Appendix G) or unrolls 
many times to prefetch data for a future iteration. 


Example 


Answer 
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Issuing prefetch instructions incurs an instruction overhead, however, so 
compilers must take care to ensure that such overheads do not exceed the bene- 
fits. By concentrating on references that are likely to be cache misses, programs 
can avoid unnecessary prefetches while improving average memory access time 
significantly. 


For the code below, determine which accesses are likely to cause data cache 
misses. Next, insert prefetch instructions to reduce misses. Finally, calculate the 
number of prefetch instructions executed and the misses avoided by prefetching. 
Let's assume we have an 8 KB direct-mapped data cache with 16-byte blocks, 
and it is a write-back cache that does write allocate. The elements of a and b are 8 
bytes long since they are double-precision floating-point arrays. There are 3 rows 
and 100 columns for a and 101 rows and 3 columns for b. Let's also assume they 
are not in the cache at the start of the program. 


for (i 0; 1 < 3; i = i+!) 
for J = 0; j < 100; j = j+1) 
ali][j] = b[j][O] * b[j+1] [0]; 


The compiler will first determine which accesses are likely to cause cache 
misses; otherwise, we will waste time on issuing prefetch instructions for data 
that would be hits. Elements of a are written in the order that they are stored in 
memory, so a will benefit from spatial locality: The even values of j will miss 
and the odd values will hit. Since a has 3 rows and 100 columns, its accesses will 
lead to 3x 100 , or 150 misses. 


The array b does not benefit from spatial locality since the accesses are not in 
the order it is stored. The array b does benefit twice from temporal locality: The 
same elements are accessed for each iteration of i, and each iteration of j uses 
the same value of b as the last iteration. Ignoring potential conflict misses, the 
misses due to b will be for b [j + 1] [0] accesses when i = 0, and also the first 
access to b[j] [0] when j = 0. Since j goes from 0 to 99 when i = 0, accesses to 
b lead to 100 + 1, or 101 misses. 

Thus, this loop will miss the data cache approximately 150 times for a plus 
101 times for b, or 251 misses. 

To simplify our optimization, we will not worry about prefetching the first 
accesses of the loop. These may already be in the cache, or we will pay the miss 
penalty of the first few elements of a or b. Nor will we worry about suppressing the 
prefetches at the end of the loop that try to prefetch beyond the end of a 
(a[i] [100] ... a[i] [106]) and the end of b (b [101] [0] ... b[107] [0]). If these 
were faulting prefetches, we could not take this luxury. Let's assume that the miss 
penalty is so large we need to start prefetching at least, say, seven iterations in 
advance. (Stated alternatively, we assume prefetching has no benefit until the eighth 
iteration.) We underline the changes to the code above needed to add prefetching. 
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Example 


Answer 


for (j = 07 j < 1005). 5 = JAI 
prefetch(b[j+7][0]); 
/* b(j,0) for 7 iterations later */ 
prefetch(a[0] [j+7]); 
/* a(0,j) for 7 iterations later */ 
al0][j] = bij] [0] * bl j+1][0];}; 
for (i = 1; i < 3; i = itl) 
for (j = 0; j < 100; j = j+1) { 
refetch(a]i|[j+7 
/* a(i,j) for +7 iterations */ 
a[i] [j] = b[j] [0] * b[j+1] [0] ;} 


This revised code prefetches a[i] [7] through a[i] [99] and b[7] [0] through 
b [100] [0], reducing the number of nonprefetched misses to 


e 7 misses for elements b [0] [0], b[1] [0],..., b [6] [0] in the first loop 


e 4 misses ({7/2]) for elements a[0] [0], a[0] [1], . . . , a[0] [6] in the first 
loop (spatial locality reduces misses to 1 per 16-byte cache block) 

e 4 misses ([7/2]) for elements a [1] [0], a[1] [1], .. ., a[1] [6] in the second 
loop 


e 4 misses ([7/2]) for elements a [2] [0], a[2] [1],. . ., a [2] [6] in the second 
loop 


or a total of 19 nonprefetched misses. The cost of avoiding 232 cache misses is 
executing 400 prefetch instructions, likely a good trade-off. 


Calculate the time saved in the example above. Ignore instruction cache misses 
and assume there are no conflict or capacity misses in the data cache. Assume 
that prefetches can overlap with each other and with cache misses, thereby trans- 
ferring at the maximum memory bandwidth. Here are the key loop times ignoring 
cache misses: The original loop takes 7 clock cycles per iteration, the first 
prefetch loop takes 9 clock cycles per iteration, and the second prefetch loop 
takes 8 clock cycles per iteration (including the overhead of the outer for loop). A 
miss takes 100 clock cycles. 


The original doubly nested loop executes the multiply 3 x 100 or 300 times. 
Since the loop takes 7 clock cycles per iteration, the total is 300 x 7 or 2100 clock 
cycles plus cache misses. Cache misses add 251 x 100 or 25,100 clock cycles, 
giving a total of 27,200 clock cycles. The first prefetch loop iterates 100 times; at 
9 clock cycles per iteration the total is 900 clock cycles plus cache misses. They 
add 11 x 100 or 1100 clock cycles for cache misses, giving a total of 2000. The 
second loop executes 2 X 100 or 200 times, and at 8 clock cycles per iteration it 
takes 1600 clock cycles plus 8 x 100 or 800 clock cycles for cache misses. This 
gives a total of 2400 clock cycles. From the prior example, we know that this 
code executes 400 prefetch instructions during the 2000 + 2400 or 4400 clock 
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cycles to execute these two loops. If we assume that the prefetches are com- 
pletely overlapped with the rest of the execution, then the prefetch code is 
27,200/4400 or 6.2 times faster. 


Although array optimizations are easy to understand, modern programs are 
more likely to use pointers. Luk and Mowry [1999] have demonstrated that 
compiler-based prefetching can sometimes be extended to pointers as well. Of 10 
programs with recursive data structures, prefetching all pointers when a node is 
visited improved performance by 4% to 31% in half the programs. On the other 
hand, the remaining programs were still within 2% of their original performance. 
The issue is both whether prefetches are to data already in the cache and whether 
they occur early enough for the data to arrive by the time it is needed. 


Cache Optimization Summary 


The techniques to improve hit time, bandwidth, miss penalty, and miss rate gen- 
erally affect the other components of the average memory access equation as well 
as the complexity of the memory hierarchy. Figure 5.11 summarizes these tech- 
niques and estimates the impact on complexity, with + meaning that the tech- 
nique improves the factor, - meaning it hurts that factor, and blank meaning it has 
no impact. Generally, no technique helps more than one category. 
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Technique time width alty rate complexity | Comment 

Small and simple caches - 0 Trivial; widely used 

Way-predicting caches 1 Used in Pentium 4 

Trace caches 3 Used in Pentium 4 

Pipelined cache access - 1 Widely used 

Nonblocking caches + 3 Widely used 

Banked caches 1 Used in L2 of Opteron and Niagara 

Critical word first + 2 Widely used 

and early restart 

Merging write buffer + 1 Widely used with write through 

Compiler techniques to reduce + 0 Software is a challenge; some 

cache misses computers have compiler option 

Hardware prefetching of + + 2 instr., Many prefetch instructions; 

instructions and data 3 data Opteron and Pentium 4 prefetch 
data 

Compiler-controlled + + 3 Needs nonblocking cache; possible 


prefetching 


instruction overhead; in many CPUs 





Figure 5.11 Summary of 11 advanced cache optimizations showing impact on cache performance and complex- 
ity. Although generally a technique helps only one factor, prefetching can reduce misses if done sufficiently early; if not, 
it can reduce miss penalty. + means that the technique improves the factor, - means it hurts that factor, and blank 
means it has no impact. The complexity measure is subjective, with 0 being the easiest and 3 being a challenge. 
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5.3 


Memory Technology and Optimizations 


..the one single development that put computers on their feet was the invention 
of a reliable form of memory, namely, the core memory. ...lts cost was reasonable, 
it was reliable and, because it was reliable, it could in due course be made large. 
[p. 209] 
Maurice Wilkes 
Memoirs of a Computer Pioneer (1985) 


Main memory is the next level down in the hierarchy. Main memory satisfies the 
demands of caches and serves as the I/O interface, as it is the destination of input 
as well as the source for output. Performance measures of main memory empha- 
size both latency and bandwidth. Traditionally, main memory latency (which 
affects the cache miss penalty) is the primary concern of the cache, while main 
memory bandwidth is the primary concern of multiprocessors and I/O. Chapter 4 
discusses the relationship of main memory and multiprocessors, and Chapter 6 
discusses the relationship of main memory and I/O. 

Although caches benefit from low-latency memory, it is generally easier to 
improve memory bandwidth with new organizations than it is to reduce latency. 
The popularity of second-level caches, and their larger block sizes, makes main 
memory bandwidth important to caches as well. In fact, cache designers increase 
block size to take advantage of the high memory bandwidth. 

The previous sections describe what can be done with cache organization to 
reduce this processor-DRAM performance gap, but simply making caches larger 
or adding more levels of caches cannot eliminate the gap. Innovations in main 
memory are needed as well. 

In the past, the innovation was how to organize the many DRAM chips that 
made up the main memory, such as multiple memory banks. Higher bandwidth is 
available using memory banks, by making memory and its bus wider, or doing 
both. 

Ironically, as capacity per memory chip increases, there are fewer chips in the 
same-sized memory system, reducing chances for innovation. For example, a 
2 GB main memory takes 256 memory chips of 64 Mbit (16M x 4 bits), easily 
organized into 16 64-bit-wide banks of 16 memory chips. However, it takes only 
16 256M x 4-bit memory chips for 2 GB, making one 64-bit-wide bank the limit. 
Since computers are often sold and benchmarked with small, standard memory 
configurations, manufacturers cannot rely on very large memories to get band- 
width. This shrinking number of chips in a standard configuration shrinks the 
importance of innovations at the board level. 

Hence, memory innovations are now happening inside the DRAM chips 
themselves. This section describes the technology inside the memory chips and 
those innovative, internal organizations. Before describing the technologies and 
options, let's go over the performance metrics. 

Memory latency is traditionally quoted using two measures—access time and 
cycle time. Access time is the time between when a read is requested and when 
the desired word arrives, while cycle time is the minimum time between requests 
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to memory. One reason that cycle time is greater than access time is that the 
memory needs the address lines to be stable between accesses. 

Virtually all desktop or server computers since 1975 used DRAMs for main 
memory, and virtually all use SRAMs for cache, our first topic. 


SRAM Technology 


The first letter of SRAM stands for static. The dynamic nature of the circuits in 
DRAM requires data to be written back after being read—hence the difference 
between the access time and the cycle time as well as the need to refresh. SRAMs 
don't need to refresh and so the access time is very close to the cycle time. 
SRAMSs typically use six transistors per bit to prevent the information from being 
disturbed when read. SRAM needs only minimal power to retain the charge in 
standby mode. 

SRAM designs are concerned with speed and capacity, while in DRAM 
designs the emphasis is on cost per bit and capacity. For memories designed in 
comparable technologies, the capacity of DRAMs is roughly 4-8 times that of 
SRAMs. The cycle time of SRAMs is 8-16 times faster than DRAMs, but they 
are also 8-16 times as expensive. 


DRAM Technology 


As early DRAMs grew in capacity, the cost of a package with all the necessary 
address lines was an issue. The solution was to multiplex the address lines, 
thereby cutting the number of address pins in half. Figure 5.12 shows the basic 
DRAM organization. One-half of the address is sent first, called the row access 
strobe (RAS). The other half of the address, sent during the column access strobe 
(CAS), follows it. These names come from the internal chip organization, since 
the memory is organized as a rectangular matrix addressed by rows and columns. 
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Figure 5.12 Internal organization of a 64M bit DRAM. DRAMs often use banks of 
memory arrays internally and select between them. For example, instead of one 16,384 
x 16,384 memory, a DRAM might use 256 1024 x 1024 arrays or 16 2048 x 2048 arrays. 
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An additional requirement of DRAM derives from the property signified by 
its first letter, D, for dynamic. To pack more bits per chip, DRAMs use only a sin- 
gle transistor to store a bit. Reading that bit destroys the information, so it must 
be restored. This is one reason the DRAM cycle time is much longer than the 
access time. In addition, to prevent loss of information when a bit is not read or 
written, the bit must be "refreshed" periodically. Fortunately, all the bits in a row 
can be refreshed simultaneously just by reading that row. Hence, every DRAM in 
the memory system must access every row within a certain time window, such as 
8 ms. Memory controllers include hardware to refresh the DRAMs periodically. 

This requirement means that the memory system is occasionally unavailable 
because it is sending a signal telling every chip to refresh. The time for a refresh 
is typically a full memory access (RAS and CAS) for each row of the DRAM. 
Since the memory matrix in a DRAM is conceptually square, the number of steps 
in a refresh is usually the square root of the DRAM capacity. DRAM designers 
try to keep time spent refreshing to less than 5% of the total time. 

So far we have presented main memory as if it operated like a Swiss train, 
consistently delivering the goods exactly according to schedule. Refresh belies 
that analogy, since some accesses take much longer than others do. Thus, refresh 
is another reason for variability of memory latency and hence cache miss penalty. 

Amdahl suggested a rule of thumb that memory capacity should grow lin- 
early with processor speed to keep a balanced system, so that a 1000 MIPS pro- 
cessor should have 1000 MB of memory. Processor designers rely on DRAMs to 
supply that demand: In the past, they expected a fourfold improvement in capac- 
ity every three years, or 55% per year. Unfortunately, the performance of 
DRAM sS is growing at a much slower rate. Figure 5.13 shows a performance 
improvement in row access time, which is related to latency, of about 5% per 
year. The CAS or data transfer time, which is related to bandwidth, is growing at 
more than twice that rate. 

Although we have been talking about individual chips, DRAMs are com- 
monly sold on small boards called dual inline memory modules (DIMMs). 
DIMMs typically contain 4-16 DRAMs, and they are normally organized to be 8 
bytes wide (+ ECC) for desktop systems. 

In addition to the DIMM packaging and the new interfaces to improve the 
data transfer time, discussed in the following subsections, the biggest change to 
DRAMs has been a slowing down in capacity growth. DRAMs obeyed Moore's 
Law for 20 years, bringing out a new chip with four times the capacity every 
three years. Due to a slowing in demand for DRAMs, since 1998 new chips only 
double capacity every two years. In 2006, this new slower pace shows signs of 
further deceleration. 


Improving Memory Performance inside a DRAM Chip 


As Moore's Law continues to supply more transistors and as the processor- 
memory gap increases pressure on memory performance, the ideas of the previ- 
ous section have made their way inside the DRAM chip. Generally, innovation 
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Row access strobe (RAS) 
Column access 
strobe (CAS)/ 
Year of Slowest Fastest data transfer Cycle 


introduction Chip size DRAM (ns) DRAM (ns) time (ns) time (ns) 






































1980 64Kbit 180 150 75 250 
1983 256Kbit 150 120 50 220 
1986 1 Mbit 120 100 25 190 
1989 4Mbit 100 80 20 165 
1992 16M bit 80 60 15 120 
1996 64M bit 70 50 12 110 
1998 128M bit 70 50 10 100 
2000 256M bit 65 45 7 90 
2002 512Mbit 60 40 5 80 
2004 1Gbit 55 35 5 70 
2006 2Gbit 50 30 2.5 60 





Figure 5.13 Times of fast and slow DRAMs with each generation. (Cycle time is 
defined on page 310.) Performance improvement of row access time is about 5% per 
year. The improvement by a factor of 2 in column access in 1986 accompanied the 
switch from NMOS DRAMs to CMOS DRAMs. 


has led to greater bandwidth, sometimes at the cost of greater latency. This sub- 
section presents techniques that take advantage of the nature of DRAMs. 

As mentioned earlier, a DRAM access is divided into row access and column 
access. DRAMs must buffer a row of bits inside the DRAM for the column 
access, and this row is usually the square root of the DRAM size—16K bits for 
256M bits, 64K bits for 1G bits, and so on. 

Although presented logically as a single monolithic array of memory bits, the 
internal organization of DRAM actually consists of many memory modules. For 
a variety of manufacturing reasons, these modules are usually 1-4M bits. Thus, if 
you were to examine a 1G bit DRAM under a microscope, you might see 512 
memory arrays, each of 2M bits, on the chip. This large number of arrays inter- 
nally presents the opportunity to provide much higher bandwidth off chip. 

To improve bandwidth, there has been a variety of evolutionary innovations 
over time. The first was timing signals that allow repeated accesses to the row 
buffer without another row access time, typically called fast page mode. Such a 
buffer comes naturally, as each array will buffer 1024-2048 bits for each access. 

Conventional DRAMs had an asynchronous interface to the memory control- 
ler, and hence every transfer involved overhead to synchronize with the control- 
ler. The second major change was to add a clock signal to the DRAM interface, 
so that the repeated transfers would not bear that overhead. Synchronous DRAM 
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Example 


Answer 
































Clock rate M transfers DRAM MB/sec DIMM 
Standard (MHz) per second name /DIMM name 
DDR 133 266 DDR266 2128 PC2100 
DDR 150 300 DDR300 2400 PC2400 
DDR 200 400 DDR400 3200 PC3200 
DDR2 266 533 DDR2-533 4264 PC4300 
DDR2 333 667 DDR2-667 5336 PC5300 
DDR2 400 800 DDR2-800 6400 PC6400 
DDR3 533 1066 DDR3-1066 8528 PC8500 
DDR3 666 1333 DDR3- 1333 10,664 PC 10700 
DDR3 800 1600 DDR3-1600 12,800 PC 12800 





Figure 5.14 Clock rates, bandwidth, and names of DDR DRAMS and DIMMs in 2006. 
Note the numerical relationship between the columns. The third column is twice the 
second, and the fourth uses the number from the third column in the name of the 
DRAM chip.The fifth column is eight times the third column, and a rounded version of 
this number is used in the name of the DIMM. Although not shown in this figure, DDRs 
also specify latency in clock cycles.The name DDR400 CL3 means that memory delays 3 
clock cycles of 5 ns each—the clock period a 200 MHz clock—before starting to deliver 
the request data. The exercises explore these details further. 


(SDRAM) is the name of this optimization. SDRAMs typically also had a pro- 
grammable register to hold the number of bytes requested, and hence can send 
many bytes over several cycles per request. 

The third major DRAM innovation to increase bandwidth is to transfer data 
on both the rising edge and falling edge of the DRAM clock signal, thereby dou- 
bling the peak data rate. This optimization is called double data rate (DDR). To 
supply data at these high rates, DDR SDRAMs activate multiple banks internally. 

The bus speeds for these DRAMs are also 133-200 MHz, but these DDR 
DIMMs are confusingly labeled by the peak DIMM bandwidth. Hence, the 
DIMM name PC2100 comes from 133 MHz x 2 x 8 bytes or 2100 MB/sec. Sus- 
taining the confusion, the chips themselves are labeled with the number of bits 
per second rather than their clock rate, so a 133 MHz DDR chip is called a 
DDR266. Figure 5.14 shows the relationship between clock rate, transfers per 
second per chip, chip name, DIMM bandwidth, and DIMM name. 


Suppose you measured a new DDR3 DIMM to transfer at 16000 MB/sec. What 
do you think its name will be? What is the clock rate of that DIMM? What is your 
guess of the name of DRAMs used in that DIMM? 


A good guideline is to assume that DRAM marketers picked names with the big- 
gest numbers. The DIMM name is likely PC 16000. The clock rate of the DIMM is 
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Clock rate x 2 x 8 = 16000 
Clock rate = 16000/16 
Clock rate = 1000 


or 1000 MHz and 2000 M transfers per second, so the DRAM name is likely to 
be DDR3-2000. 


DDR is now a sequence of standards. DDR2 lowers power by dropping the 
voltage from 2.5 volts to 1.8 volts and offers higher clock rates: 266 MHz, 333 
MHz, and 400 MHz. DDR3 drops voltage to 1.5 volts and has a maximum clock 
speed of 800 MHz. 

In each of these three cases, the advantage of such optimizations is that they 
add a small amount of logic to exploit the high potential internal DRAM band- 
width, adding little cost to the system while achieving a significant improvement 
in bandwidth. 


Protection: Virtual Memory and Virtual Machines 


A virtual machine is taken to be an efficient, isolated duplicate of the real 
machine. We explain these notions through the idea of a virtual machine monitor 
(VMM)___a VMM has three essential characteristics. First, the VMM provides an 
environment for programs which is essentially identical with the original machine; 
second, programs run in this environment show at worst only minor decreases in 
speed; and last, the VMM is in complete control of system resources. 


Gerald Popek and Robert Goldberg 
"Formal requirements for virtualizable third generation architectures," 
Communications of the ACM (July 1974) 


Security and privacy are two of the most vexing challenges for information tech- 
nology in 2006. Electronic burglaries, often involving lists of credit card num- 
bers, are announced regularly, and it's widely believed that many more go 
unreported. Hence, both researchers and practitioners are looking for new ways 
to make computing systems more secure. Although protecting information is not 
limited to hardware, in our view real security and privacy will likely involve 
innovation in computer architecture as well as in systems software. 

This section starts with a review of the architecture support for protecting 
processes from each other via virtual memory. It then describes the added protec- 
tion provided from virtual machines, the architecture requirements of virtual 
machines, and the performance of a virtual machine. 


Protection via Virtual Memory 


Page-based virtual memory, including a translation lookaside buffer that caches 
page table entries, is the primary mechanism that protects processes from each 
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other. Sections C.4 and C.5 in Appendix C review virtual memory, including a 
detailed description of protection via segmentation and paging in the 80x86. This 
subsection acts as a quick review; refer to those sections if it's too quick. 

Multiprogramming, where several programs running concurrently would 
share a computer, led to demands for protection and sharing among programs and 
to the concept of a process. Metaphorically, a process is a program's breathing air 
and living space—that is, a running program plus any state needed to continue 
running it. At any instant, it must be possible to switch from one process to 
another. This exchange is called a process switch or context switch. 

The operating system and architecture join forces to allow processes to share 
the hardware yet not interfere with each other. To do this, the architecture must 
limit what a process can access when running a user process yet allow an operat- 
ing system process to access more. At the minimum, the architecture must do the 
following: 


1. Provide at least two modes, indicating whether the running process is a user 
process or an operating system process. This latter process is sometimes 
called a kernel process or a supervisor process. 


2. Provide a portion of the processor state that a user process can use but not 
write. This state includes an user/supervisor mode bit(s), an exception enable/ 
disable bit, and memory protection information. Users are prevented from 
writing this state because the operating system cannot control user processes 
if users can give themselves supervisor privileges, disable exceptions, or 
change memory protection. 


3. Provide mechanisms whereby the processor can go from user mode to super- 
visor mode and vice versa. The first direction is typically accomplished by a 
system call, implemented as a special instruction that transfers control to a 
dedicated location in supervisor code space. The PC is saved from the point 
of the system call, and the processor is placed in supervisor mode. The return 
to user mode is like a subroutine return that restores the previous user/super- 
visor mode. 


4. Provide mechanisms to limit memory accesses to protect the memory state of 
a process without having to swap the process to disk on a context switch. 


Appendix C describes several memory protection schemes, but by far the 
most popular is adding protection restrictions to each page of virtual memory. 
Fixed-sized pages, typically 4 KB or 8 KB long, are mapped from the virtual 
address space into physical address space via a page table. The protection restric- 
tions are included in each page table entry. The protection restrictions might 
determine whether a user process can read this page, whether a user process can 
write to this page, and whether code can be executed from this page. In addition, 
a process can neither read nor write a page if it is not in the page table. Since only 
the OS can update the page table, the paging mechanism provides total access 
protection. 

Paged virtual memory means that every memory access logically takes at 
least twice as long, with one memory access to obtain the physical address and a 
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second access to get the data. This cost would be far too dear. The solution is to 
rely on the principle of locality; if the accesses have locality, then the address 
translations for the accesses must also have locality. By keeping these address 
translations in a special cache, a memory access rarely requires a second access 
to translate the data. This special address translation cache is referred to as a 
translation lookaside buffer (TLB). 

A TLB entry is like a cache entry where the tag holds portions of the virtual 
address and the data portion holds a physical page address, protection field, valid 
bit, and usually a use bit and a dirty bit. The operating system changes these bits 
by changing the value in the page table and then invalidating the corresponding 
TLB entry. When the entry is reloaded from the page table, the TLB gets an accu- 
rate copy of the bits. 

Assuming the computer faithfully obeys the restrictions on pages and maps 
virtual addresses to physical addresses, it would seem that we are done. Newspa- 
per headlines suggest otherwise. 

The reason we're not done is that we depend on the accuracy of the operating 
system as well as the hardware. Today's operating systems consist of tens of mil- 
lions of lines of code. Since bugs are measured in number per thousand lines of 
code, there are thousands of bugs in production operating systems. Flaws in the 
OS have led to vulnerabilities that are routinely exploited. 

This problem, and the possibility that not enforcing protection could be much 
more costly than in the past, has led some to look for a protection model with a 
much smaller code base than the full OS, such as Virtual Machines. 


Protection via Virtual Machines 


An idea related to virtual memory that is almost as old is Virtual Machines (VM). 
They were first developed in the late 1960s, and they have remained an important 
part of mainframe computing over the years. Although largely ignored in the 
domain of single-user computers in the 1980s and 1990s, they have recently gained 
popularity due to 


e the increasing importance of isolation and security in modern systems, 
e the failures in security and reliability of standard operating systems, 
e the sharing of a single computer among many unrelated users, and 


e the dramatic increases in raw speed of processors, which makes the overhead 
of VMs more acceptable. 


The broadest definition of VMs includes basically all emulation methods that 
provide a standard software interface, such as the Java VM. We are interested in 
VMs that provide a complete system-level environment at the binary instruction 
set architecture (ISA) level. Although some VMs run different ISAs in the VM 
from the native hardware, we assume they always match the hardware. Such VMs 
are called (Operating) System Virtual Machines. IBM VM/370, VMware ESX 
Server, and Xen are examples. They present the illusion that the users of a VM 
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have an entire computer to themselves, including a copy of the operating system. 
A single computer runs multiple VMs and can support a number of different 
operating systems (OSes). On a conventional platform, a single OS "owns" all 
the hardware resources, but with a VM, multiple OSes all share the hardware 
resources. 

The software that supports VMs is called a virtual machine monitor (VMM) 
or hypervisor; the VMM is the heart of Virtual Machine technology. The underly- 
ing hardware platform is called the host, and its resources are shared among the 
guest VMs. The VMM determines how to map virtual resources to physical 
resources: A physical resource may be time-shared, partitioned, or even emulated 
in software. The VMM is much smaller than a traditional OS; the isolation por- 
tion of a VMM is perhaps only 10,000 lines of code. 

In general, the cost of processor virtualization depends on the workload. 
User-level processor-bound programs, such as SPEC CPU2006, have zero virtu- 
alization overhead because the OS is rarely invoked so everything runs at native 
speeds. I/O-intensive workloads generally are also OS-intensive, which execute 
many system calls and privileged instructions that can result in high virtualiza- 
tion overhead. The overhead is determined by the number of instructions that 
must be emulated by the VMM and how slowly they are emulated. Hence, when 
the guest VMs run the same ISA as the host, as we assume here, the goal of the 
architecture and the VMM is to run almost all instructions directly on the native 
hardware. On the other hand, if the I/O-intensive workload is also //O-bound, 
the cost of processor virtualization can be completely hidden by low processor 
utilization since it is often waiting for I/O (as we will see later in Figures 5.15 
and 5.16). 

Although our interest here is in VMs for improving protection, VMs provide 
two other benefits that are commercially significant: 


1. Managing software. VMs provide an abstraction that can run the complete 
software stack, even including old operating systems like DOS. A typical 
deployment might be some VMs running legacy OSes, many running the cur- 
rent stable OS release, and a few testing the next OS release. 


2. Managing hardware. One reason for multiple servers is to have each applica- 
tion running with the compatible version of the operating system on separate 
computers, as this separation can improve dependability. VMs allow these 
separate software stacks to ran independently yet share hardware, thereby 
consolidating the number of servers. Another example is that some VMMs 
support migration of a running VM to a different computer, either to balance 
load or to evacuate from failing hardware. 


Requirements of a Virtual Machine Monitor 


What must a VM monitor do? It presents a software interface to guest software, it 
must isolate the state of guests from each other, and it must protect itself from guest 
software (including guest OSes). The qualitative requirements are 
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e Guest software should behave on a VM exactly as if it were running on the 
native hardware, except for performance-related behavior or limitations of 
fixed resources shared by multiple VMs. 


e Guest software should not be able to change allocation of real system 
resources directly. 


To "virtualize" the processor, the VMM must control just about everything— 
access to privileged state, address translation, I/O, exceptions and interrupts— 
even though the guest VM and OS currently running are temporarily using them. 

For example, in the case of a timer interrupt, the VMM would suspend the 
currently running guest VM, save its state, handle the interrupt, determine which 
guest VM to run next, and then load its state. Guest VMs that rely on a timer 
interrupt are provided with a virtual timer and an emulated timer interrupt by the 
VMM. 

To be in charge, the VMM must be at a higher privilege level than the guest 
VM, which generally runs in user mode; this also ensures that the execution of any 
privileged instruction will be handled by the VMM. The basic requirements of 
system virtual machines are almost identical to those for paged virtual memory 
listed above: 


e At least two processor modes, system and user. 


e A privileged subset of instructions that is available only in system mode, 
resulting in a trap if executed in user mode. All system resources must be 
controllable only via these instructions. 


(Lack of) Instruction Set Architecture Support for 
Virtual Machines 


If VMs are planned for during the design of the ISA, it's relatively easy to both 
reduce the number of instructions that must be executed by a VMM and how long 
it takes to emulate them. An architecture that allows the VM to execute directly 
on the hardware earns the title virtualizable, and the IBM 370 architecture 
proudly bears that label. 

Alas, since VMs have been considered for desktop and PC-based server 
applications only fairly recently, most instruction sets were created without virtu- 
alization in mind. These culprits include 80x86 and most RISC architectures. 

Because the VMM must ensure that the guest system only interacts with vir- 
tual resources, a conventional guest OS runs as a user mode program on top of 
the VMM. Then, if a guest OS attempts to access or modify information related 
to hardware resources via a privileged instruction—for example, reading or writ- 
ing the page table pointer—it will trap to the VMM. The VMM can then effect 
the appropriate changes to corresponding real resources. 

Hence, if any instruction that tries to read or write such sensitive information 
traps when executed in user mode, the VMM can intercept it and support a virtual 
version of the sensitive information as the guest OS expects. 
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In the absence of such support, other measures must be taken. A VMM must 
take special precautions to locate all problematic instructions and ensure that they 
behave correctly when executed by a guest OS, thereby increasing the complex- 
ity of the VMM and reducing the performance of running the VM. 

Sections 5.5 and 5.7 give concrete examples of problematic instructions in 
the 80x86 architecture. 


Impact of Virtual Machines on Virtual Memory and I/O 


Another challenge is virtualization of virtual memory, as each guest OS in every 
VM manages its own set of page tables. To make this work, the VMM separates 
the notions of real and physical memory (which are often treated synonymously), 
and makes real memory a separate, intermediate level between virtual memory 
and physical memory. (Some use the terms virtual memory, physical memory, 
and machine memory to name the same three levels.) The guest OS maps virtual 
memory to real memory via its page tables, and the VMM page tables map the 
guests' real memory to physical memory. The virtual memory architecture is 
specified either via page tables, as in IBM VM/370 and the 80x86, or via the TLB 
structure, as in many RISC architectures. 

Rather than pay an extra level of indirection on every memory access, the 
VMM maintains a shadow page table that maps directly from the guest virtual 
address space to the physical address space of the hardware. By detecting all mod- 
ifications to the guest's page table, the VMM can ensure the shadow page table 
entries being used by the hardware for translations correspond to those of the 
guest OS environment, with the exception of the correct physical pages substituted 
for the real pages in the guest tables. Hence, the VMM must trap any attempt by 
the guest OS to change its page table or to access the page table pointer. This is 
commonly done by write protecting the guest page tables and trapping any access 
to the page table pointer by a guest OS. As noted above, the latter happens natu- 
rally if accessing the page table pointer is a privileged operation. 

The IBM 370 architecture solved the page table problem in the 1970s with an 
additional level of indirection that is managed by the VMM. The guest OS keeps 
its page tables as before, so the shadow pages are unnecessary. AMD has pro- 
posed a similar scheme for their Pacifica revision to the 80x86. 

To virtualize the TLB architected in many RISC computers, the VMM man- 
ages the real TLB and has a copy of the contents of the TLB of each guest VM. 
To pull this off, any instructions that access the TLB must trap. TLBs with Pro- 
cess ID tags can support a mix of entries from different VMs and the VMM, 
thereby avoiding flushing of the TLB on a VM switch. Meanwhile, in the back- 
ground, the VMM supports a mapping between the VMs' virtual Process IDs and 
the real Process IDs. 

The final portion of the architecture to virtualize is I/O. This is by far the most 
difficult part of system virtualization because of the increasing number of I/O 
devices attached to the computer and the increasing diversity of I/O device types. 
Another difficulty is the sharing of a real device among multiple VMs, and yet 
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another comes from supporting the myriad of device drivers that are required, espe- 
cially if different guest OSes are supported on the same VM system. The VM illu- 
sion can be maintained by giving each VM generic versions of each type of I/O 
device driver, and then leaving it to the VMM to handle real I/O. 

The method for mapping a virtual to physical I/O device depends on the type 
of device. For example, physical disks are normally partitioned by the VMM to 
create virtual disks for guest VMs, and the VMM maintains the mapping of vir- 
tual tracks and sectors to the physical ones. Network interfaces are often shared 
between VMs in very short time slices, and the job of the VMM is to keep track 
of messages for the virtual network addresses to ensure that guest VMs receive 
only messages intended for them. 


An Example VMM:The Xen Virtual Machine 


Early in the development of VMs, a number of inefficiencies became apparent. 
For example, a guest OS manages its virtual to real page mapping, but this map- 
ping is ignored by the VMM, which performs the actual mapping to physical 
pages. In other words, a significant amount of wasted effort is expended just to 
keep the guest OS happy. To reduce such inefficiencies, VMM developers 
decided that it may be worthwhile to allow the guest OS to be aware that it is run- 
ning on a VM. For example, a guest OS could assume a real memory as large as 
its virtual memory so that no memory management is required by the guest OS. 

Allowing small modifications to the guest OS to simplify virtualization is 
referred to as paravirtualization, and the open source Xen VMM is a good exam- 
ple. The Xen VMM provides a guest OS with a virtual machine abstraction that is 
similar to the physical hardware, but it drops many of the troublesome pieces. For 
example, to avoid flushing the TLB, Xen maps itself into the upper 64 MB of the 
address space of each VM. It allows the guest OS to allocate pages, just checking 
to be sure it does not violate protection restrictions. To protect the guest OS from 
the user programs in the VM^Xen takes advantage of the four protection levels 
available in the 80x86. The Xen VMM runs at the highest privilege level (0), the 
guest OS runs at the next level (1), and the applications run at the lowest privilege 
level (3). Most OSes fortfie 80x86 keep everything at privilege levels 0 or 3. 

For subsetting to work properly, Xen modifies the guest OS to not use prob- 
lematic portions of the architecture. For example, the port of Linux to Xen 
changed about 3000 lines, or about 1% of the 80x86-specific code. These 
changes, however, do not affect the application-binary interfaces of the guest OS. 

To simplify the I/O challenge of VMs, Xen recently assigned privileged vir- 
tual machines to each hardware I/O device. These special VMs are called driver 
domains. (Xen calls its VMs "domains.") Driver domains run the physical device 
drivers, although interrupts are still handled by the VMM before being sent to the 
appropriate driver domain. Regular VMs, called guest domains, run simple vir- 
tual device drivers that must communicate with the physical device drivers in the 
driver domains over a channel to access the physical I/O hardware. Data are sent 
between guest and driver domains by page remapping. 
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Figure 5.15 compares the relative performance of Xen for six benchmarks. 
According to these experiments, Xen performs very close to the native perfor- 
mance of Linux. The popularity of Xen, plus such performance results, led stan- 
dard releases of the Linux kernel to incorporate Xen's paravirtualization changes. 

A subsequent study noticed that the experiments in Figure 5.15 were based on a 
single Ethernet network interface card (NIC), and the single NIC was a perfor- 
mance bottleneck. As a result, the higher processor utilization of Xen did not affect 
performance. Figure 5.16 compares TCP receive performance as the number of 
NICs increases from 1 to 4 for native Linux and two configurations of Xen: 


1. Xen privileged VM only (driver domain). To measure the overhead of Xen 
without the driver VM scheme, the whole application is run inside the single 
privileged driver domain. 


2. Xen guest VM + privileged VM. In the more natural setting, the application 
and virtual device driver run in the guest VM (guest domain), and the physi- 
cal device driver runs in the privileged driver VM (driver domain). 


Clearly, a single NIC is a bottleneck. Xen driver VM peaks at 19 Gbits/sec with 
2 NICs while native Linux peaks at 2.5 Gbits/sec with 3 NICs. For guest VMs, 
the peak receive rate drops under 0.9 Gbits/sec. 

After removing the NIC bottleneck, a different Web server workload showed 
that driver VM Xen achieves less than 80% of the throughput of native Linux, 
while guest VM + driver VM drops to 34%. 
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Figure 5.15 Relative performance for Xen versus native Linux.The experiments were 
performed on a Dell 2650 dual processor 2.4 GHz Xeon server with 2 GB RAM, one 
Broadcom Tigon 3 Gigabit Ethernet NIC, a single Hitachi DK32EJ 146 GB 10K RPM SCSI 
disk, and running Linux version 2.4.21 [Barham et al. 2003; Clark et al. 2004]. 
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Figure 5.16 TCP receive performance in Mbits/sec for native Linux versus two con- 
figurations of Xen. Guest VM + driver VM is the conventional configuration [Menon et 
al. 2005].The experiments were performed on a Dell PowerEdge 1600SC running a 2.4 
GHzXeon server with 1 GB RAM, and four Intel Pro-1000 Gigabit Ethernet NIC, running 
Linux version 2.6.10 and Xen version 2.0.3. 
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Figure 5.17 Relative change in instructions executed, L2 cache misses, and FTLB 
and D-TLB misses of native Linux versus two configurations of Xen for a Web work- 
load [Menon et al. 2005]. Higher L2 and TLB misses come from the lack of support in 
Xen for superpages, globally marked PTEs, and gather DMA [Menon 2006]. 


Figure 5.17 explains this drop in performance by plotting the relative change 
in instructions executed, L2 cache misses, and instruction and data TLB misses 
for native Linux and the two Xen configurations. Data TLB misses per instruc- 
tion are 12-24 times higher for Xen than for native Linux, and this is the primary 
reason for the slowdown for the privileged driver VM configuration. The higher 
TLB misses are because of two optimizations that Linux uses that Xen does not: 
superpages and marking page table entries as global. Linux uses superpages for 
part of its kernel space, and using 4 MB pages obviously lowers TLB misses ver- 
sus using 1024 4 KB pages. PTEs marked global are not flushed on a context 
switch, and Linux uses them for its kernel space. 
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In addition to higher D-TLB misses, the more natural guest VM + driver VM 
configuration executes more than twice as many instructions. The increase is due 
to page remapping and page transfer between the driver and guest VMs and due 
to communication between the two VMs over a channel. This is also the reason 
for the lower receive performance of guest VMs in Figure 5.16. In addition, the 
guest VM configuration has more than four times as many L2 caches misses. The 
reason is Linux uses a zero-copy network interface that depends on the ability of 
the NIC to do DMA from different locations in memory. Since Xen does not sup- 
port "gather DMA" in its virtual network interface, it can't do true zero-copy in 
the guest VM, resulting in more L2 cache misses. 

While future versions of Xen may be able to incorporate support for super- 
pages, globally marked PTEs, and gather DMA, the higher instruction overhead 
looks to be inherent in the split between guest VM and driver VM. 


Crosscutting Issues: The Design of Memory Hierarchies 


This section describes three topics discussed in other chapters that are fundamen- 
tal to memory hierarchies. 


Protection and Instruction Set Architecture 


Protection is a joint effort of architecture and operating systems, but architects 
had to modify some awkward details of existing instruction set architectures 
when virtual memory became popular. For example, to support virtual memory in 
the IBM 370, architects had to change the successful IBM 360 instruction set 
architecture that had been announced just six years before. Similar adjustments 
are being made today to accommodate virtual machines. 

For example, the 80x86 instruction POPF loads the flag registers from the top 
of the stack in memory. One of the flags is the Interrupt Enable (IE) flag. If you 
run the POPF instruction in user mode, rather than trap it simply changes all the 
flags except IE. In system mode, it does change the IE. Since a guest OS runs in 
user mode inside a VM, this is a problem, as it expects to see a changed IE. 

Historically, IBM mainframe hardware and VMM took three steps to improve 
performance of virtual machines: 


1. Reduce the cost of processor virtualization 
2. Reduce interrupt overhead cost due to the virtualization 
3. Reduce interrupt cost by steering interrupts to the proper VM without invok- 


ing VMM 


IBM is still the gold standard of virtual machine technology. For example, an 
IBM mainframe ran thousands of Linux VMs in 2000, while Xen ran 25 VMs in 
2004 [Clark et al. 2004]. 
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In 2006, new proposals by AMD and Intel try to address the first point, reduc- 
ing the cost of processor virtualization (see Section 5.7). It will be interesting 
how many generations of architecture and VMM modifications it will take to 
address all three points, and how long before virtual machines of the 21st century 
will be as efficient as the IBM mainframes and VMMs of the 1970s. 


Speculative Execution and the Memory System 


Inherent in processors that support speculative execution or conditional instruc- 
tions is the possibility of generating invalid addresses that would not occur with- 
out speculative execution. Not only would this be incorrect behavior if protection 
exceptions were taken, but the benefits of speculative execution would be 
swamped by false exception overhead. Hence, the memory system must identify 
speculatively executed instructions and conditionally executed instructions and 
suppress the corresponding exception. 

By similar reasoning, we cannot allow such instructions to cause the cache to 
stall on a miss because again unnecessary stalls could overwhelm the benefits of 
speculation. Hence, these processors must be matched with nonblocking caches. 

In reality, the penalty of an L2 miss is so large that compilers normally only 
speculate on L1 misses. Figure 5.5 on page 297 shows that for some well- 
behaved scientific programs the compiler can sustain multiple outstanding L2 
misses to cut the L2 miss penalty effectively. Once again, for this to work, the 
memory system behind the cache must match the goals of the compiler in num- 
ber of simultaneous memory accesses. 


I/O and Consistency of Cached Data 


Data can be found in memory and in the cache. As long as one processor is the 
sole device changing or reading the data and the cache stands between the pro- 
cessor and memory, there is little danger in the processor seeing the old or stale 
copy. As mentioned in Chapter 4, multiple processors and I/O devices raise the 
opportunity for copies to be inconsistent and to read the wrong copy. 

The frequency of the cache coherency problem is different for multiproces- 
sors than I/O. Multiple data copies are a rare event for I/O—one to be avoided 
whenever possible—but a program running on multiple processors will want to 
have copies of the same data in several caches. Performance of a multiprocessor 
program depends on the performance of the system when sharing data. 

The V/O cache coherency question is this: Where does the I/O occur in the 
computer—between the I/O device and the cache or between the I/O device and 
main memory? If input puts data into the cache and output reads data from the 
cache, both I/O and the processor see the same data. The difficulty in this 
approach is that it interferes with the processor and can cause the processor to 
stall for I/O. Input may also interfere with the cache by displacing some informa- 
tion with new data that is unlikely to be accessed soon. 
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The goal for the I/O system in a computer with a cache is to prevent the stale- 
data problem while interfering as little as possible. Many systems, therefore, pre- 
fer that I/O occur directly to main memory, with main memory acting as an I/O 
buffer. If a write-through cache were used, then memory would have an up-to- 
date copy of the information, and there would be no stale-data issue for output. 
(This benefit is a reason processors used write through.) Alas, write through is 
usually found today only in first-level data caches backed by an L2 cache that 
uses write back. 

Input requires some extra work. The software solution is to guarantee that no 
blocks of the input buffer are in the cache. A page containing the buffer can be 
marked as noncachable, and the operating system can always input to such a 
page. Alternatively, the operating system can flush the buffer addresses from the 
cache before the input occurs. A hardware solution is to check the I/O addresses 
on input to see if they are in the cache. If there is a match of I/O addresses in the 
cache, the cache entries are invalidated to avoid stale data. All these approaches 
can also be used for output with write-back caches. 


Putting It All Together: AMD Opteron Memory Hierarchy 


This section unveils the AMD Opteron memory hierarchy and shows the perfor- 
mance of its components for the SPEC2000 programs. The Opteron is an out-of- 
order execution processor that fetches up to three 80x86 instructions per clock 
cycle, translates them into RISC-like operations, issues three of them per clock 
cycle, and it has 11 parallel execution units. In 2006, the 12-stage integer pipeline 
yields a maximum clock rate of 2.8 GHz, and the fastest memory supported is 
PC3200 DDR SDRAM. It uses 48-bit virtual addresses and 40-bit physical 
addresses. Figure 5.18 shows the mapping of the address through the multiple 
levels of data caches and TLBs, similar to the format of Figure 5.3 on page 292. 

We are now ready to follow the memory hierarchy in action: Figure 5.19 is 
labeled with the steps of this narrative. First, the PC is sent to the instruction 
cache. It is 64 KB, two-way set associative with a 64-byte block size and LRU 
replacement. The cache index is 


index _ Cache size _ 64K _ 512 = 2° 
Block size x Set associativity 64x2 ~~ -` 





or 9 bits. It is virtually indexed and physically tagged. Thus, the page frame of the 
instruction's data address is sent to the instruction TLB (step 1) at the same time 
the 9-bit index (plus an additional 2 bits to select the appropriate 16 bytes) from 
the virtual address is sent to the data cache (step 2). The fully associative TLB 
simultaneously searches all 40 entries to find a match between the address and a 
valid PTE (steps 3 and 4). In addition to translating the address, the TLB checks 
to see if the PTE demands that this access result in an exception due to an access 
violation. 
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Figure 5.18 The virtual address, physical address, indexes, tags, and data blocks for the AMD Opteron caches 
and TLBs. Since the instruction and data hierarchies are symmetric, we only show one.The L1 TLB is fully associa- 
tive with 40 entries. The L2 TLB is 4-way set associative with 512 entries. The L1 cache is 2-way set associative with 64- 
byte blocks and 64 KB capacity. The L2 cache is 16-way set associative with 64-byte blocks and 1 MB capacity. This fig- 
ure doesn't show the valid bits and protection bits for the caches and TLBs, as does Figure 5.19. 


To Lf cache or CPU 


An I TLB miss first goes to the L2 I TLB, which contains 512 PTEs of 4 KB 
page sizes and is four-way set associative. It takes 2 clock cycles to load the LI 
TLB from the L2 TLB. The traditional 80x86 TLB scheme flushes all TLBs if the 
page directory pointer register is changed. In contrast, Opteron checks for 
changes to the actual page directory in memory and flushes only when the data 
structure is changed, thereby avoiding some flushes. 

In the worst case, the page is not in memory, and the operating system gets 
the page from disk. Since millions of instructions could execute during a page 
fault, the operating system will swap in another process if one is waiting to run. 
Otherwise, if there is no TLB exception, the instruction cache access continues. 

The index field of the address is sent to both groups of the two-way set- 
associative data cache (step 5). The instruction cache tag is 40 - 9 bits (index) - 
6 bits (block offset) or 25 bits. The four tags and valid bits are compared to the 
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Figure 5.19 The AMD Opteron memory hierarchy.The L1 caches are both 64 KB, 2-way set associative with 64-byte 
blocks and LRU replacement. The L2 cache is 1 MB, 16-way set associative with 64-byte blocks, and pseudo LRU 
replacement.The data and L2 caches use write back with write allocation. The L1 instruction and data caches are vir- 
tually indexed and physically tagged, so every address must be sent to the instruction or data TLB at the same time 
as it is sent to a cache. Both TLBs are fully associative and have 40 entries, with 32 entries for 4 KB pages and 8 for 2 
MB or 4 MB pages. Each TLB has a 4-way set associative L2 TLB behind it, with 512 entities of 4 KB page sizes. Opteron 
supports 48-bit virtual addresses and 40-bit physical addresses. 
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physical page frame from the Instruction TLB (step 6). As the Opteron expects 
16 bytes each instruction fetch, an additional 2 bits are used from the 6-bit block 
offset to select the appropriate 16 bytes. Hence, 9 + 2 or 11 bits are used to send 
16 bytes of instructions to the processor. The LI cache is pipelined, and the 
latency of a hit is 2 clock cycles. A miss goes to the second-level cache and to the 
memory controller, to lower the miss penalty in case the L2 cache misses. 

As mentioned earlier, the instruction cache is virtually addressed and physi- 
cally tagged. On a miss, the cache controller must check for a synonym (two dif- 
ferent virtual addresses that reference the same physical address). Hence, the 
instruction cache tags are examined for synonyms in parallel with the L2 cache 
tags during an L2 lookup. As the minimum page size is 4 KB or 12 bits and the 
cache index plus block offset is 15 bits, the cache must check 2° or 8 blocks per 
way for synonyms. Opteron uses the redundant snooping tags to check all syn- 
onyms in 1 clock cycle. If it finds a synonym, the offending block is invalidated 
and refetched from memory. This guarantees that a cache block can reside in only 
one of the 16 possible data cache locations at any given time. 

The second-level cache tries to fetch the block on a miss. The L2 cache is 
1 MB, 16-way set associative with 64-byte blocks. It uses a pseudo-LRU scheme 
by managing eight pairs of blocks LRU, and then randomly picking one of the 
LRU pair on a replacement. The L2 index is 


index _ Cache size _ 1024K 


10 
2 = - = m = = 1024=2 
Block size x Set associativity 64 x 16 





so the 34-bit block address (40-bit physical address - 6-bit block offset) is 
divided into a 24-bit tag and a 10-bit index (step 8). Once again, the index and tag 
are sent to all 16 groups of the 16-way set associative data cache (step 9), which 
are compared in parallel. If one matches and is valid (step 10), it returns the block 
in sequential order, 8 bytes per clock cycle. The L2 cache also cancels the mem- 
ory request that the LI cache sent to the controller. An LI instruction cache miss 
that hits in the L2 cache costs 7 processor clock cycles for the first word. 

The Opteron has an exclusion policy between the LI caches and the L2 cache 
to try to better utilize the resources, which means a block is in LI or L2 caches 
but not in both. Hence, it does not simply place a copy of the block in the L2 
cache. Instead, the only copy of the new block is placed in the LI cache. The old 
LI block is sent to the L2 cache. If a block knocked out of the L2 cache is dirty, it 
is sent to the write buffer, called the victim buffer in the Opteron. 

In the last chapter, we showed how inclusion allows all coherency traffic to 
affect only the L2 cache and not the LI caches. Exclusion means coherency traf- 
fic must check both. To reduce interference between coherency traffic and the 
processor for the LI caches, the Opteron has a duplicate set of address tags for 
coherency snooping. 

If the instruction is not found in the secondary cache, the on-chip memory 
controller must get the block from main memory. The Opteron has dual 64-bit 
memory channels that can act as one 128-bit channel, since there is only one 
memory controller and the same address is sent on both channels (step 11). Wide 
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transfers happen when both channels have identical DIMMs. Each channel sup- 
ports up to four DDR DIMMs (step 12). 

Since the Opteron provides single-error correction/double-error detection 
checking on data cache, L2 cache, buses, and main memory, the data buses actu- 
ally include an additional 8 bits for ECC for every 64 bits of data. To reduce the 
chances of a second error, the Opteron uses idle cycles to remove single-bit errors 
by reading and rewriting damaged blocks in the data cache, L2 cache, and mem- 
ory. Since the instruction cache and TLBs are read-only structures, they are pro- 
tected by parity, and reread from lower levels if a parity error occurs. 

The total latency of the instruction miss that is serviced by main memory is 
approximately 20 processor cycles plus the DRAM latency for the critical 
instructions. For a PC3200 DDR SDRAM and 2.8 GHz CPU, the DRAM latency 
is 140 processor cycles (50 ns) to the first 16 bytes. The memory controller fills 
the remainder of the 64-byte cache block at a rate of 16 bytes per memory clock 
cycle. With 200 MHz DDR DRAM, that is three more clock edges and an extra 
7.5 ns latency, or 21 more processor cycles with a 2.8 GHz processor (step 13). 

Opteron has a prefetch engine associated with the L2 cache (step 14). It looks 
at patterns for L2 misses to consecutive blocks, either ascending or descending, 
and then prefetches the next line into the L2 cache. 

Since the second-level cache is a write-back cache, any miss can lead to an 
old block being written back to memory. The Opteron places this "victim" block 
into a victim buffer (step 15), as it does with a victim dirty block in the data 
cache. The buffer allows the original instruction fetch read that missed to proceed 
first. The Opteron sends the address of the victim out the system address bus fol- 
lowing the address of the new request. The system chip set later extracts the vic- 
tim data and writes it to the memory DIMMs. 

The victim buffer is size eight, so many victims can be queued before being 
written back either to L2 or to memory. The memory controller can manage up to 
10 simultaneous cache block misses—8 from the data cache and 2 from the 
instruction cache—allowing it to hit under 10 misses, as described in Appendix 
C. The data cache and L2 cache check the victim buffer for the missing block, but 
it stalls until the data is written to memory and then refetched. The new data are 
loaded into the instruction cache as soon as they arrive (step 16). Once again, 
because of the exclusion property, the missing block is not loaded into the 
L2 cache. 

If this initial instruction is a load, the data address is sent to the data cache 
and data TLBs, acting very much like an instruction cache access since the 
instruction and data caches and TLBs are symmetric. One difference is that the 
data cache has two banks so that it can support two loads or stores simulta- 
neously, as long as they address different banks. In addition, a write-back victim 
can be produced on a data cache miss. The victim data are extracted from the data 
cache simultaneously with the fill of the data cache with the L2 data and sent to 
the victim buffer. 

Suppose the instruction is a store instead of a load. When the store issues, it 
does a data cache lookup just like a load. A store miss causes the block to be 
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filled into the data cache very much as with a load miss, since the policy is to 
allocate on writes. The store does not update the cache until later, after it is 
known to be nonspeculative. During this time the store resides in a load-store 
queue, part of the out-of-order control mechanism of the processor. It can hold up 
to 44 entries and supports speculative forwarding results to the execution unit. 
The data cache is ECC protected, so a read-modify-write operation is required to 
update the data cache on stores. This is accomplished by assembling the full 
block in the load/store queue and always writing the entire block. 


Performance of the Opteron Memory Hierarchy 


How well does the Opteron work? The bottom line in this evaluation is the per- 
centage of time lost while the processor is waiting for the memory hierarchy. The 
major components are the instruction and data caches, instruction and data TLBs, 
and the secondary cache. Alas, in an out-of-order execution processor like the 
Opteron, it is very hard to isolate the time waiting for memory, since a memory 
stall for one instruction may be completely hidden by successful completion of a 
later instruction. 

Figure 5.20 shows the CPI and various misses per 1000 instructions for a 
benchmark similar to TPC-C on a database and the SPEC2000 programs. Clearly, 
most of the SPEC2000 programs do not tax the Opteron memory hierarchy, with 
mcf being the exception. (SPEC nicknamed it the "cache buster" because of its 
memory footprint size and its access patterns.) The average SPEC I cache misses 
per instruction is 0.01% to 0.09%, the average D cache misses per instruction are 
1.34% to 1.43%, and the average L2 cache misses per instruction are 0.23% to 
0.36%. The commercial benchmark does exercise the memory hierarchy more, 
with misses per instruction of 1.83%, 1.39%, and 0.62%, respectively. 

How do the real CPIs of Opteron compare to the peak rate of 0.33, or 3 
instructions per clock cycle? The Opteron completes on average 0.8-0.9 instruc- 
tions per clock cycle for SPEC2000, with an average CPI of 1.15-1.30. For the 
database benchmark, the higher miss rates for caches and TLBs yields a CPI of 
2.57, or 0.4 instructions per clock cycle. This factor of 2 slowdown in CPI for 
TPC-C-like benchmarks suggests that microprocessors designed in servers see 
heavier demands on the memory systems than do microprocessors for desktops. 
Figure 5.21 estimates the breakdown between the base CPI of 0.33 and the stalls 
for memory and for the pipeline. 

Figure 5.21 assumes none of the memory hierarchy misses are overlapped 
with the execution pipeline or with each other, so the pipeline stall portion is a 
lower bound. Using this calculation, the CPI above the base that is attributable to 
memory averages about 50% for the integer programs (from 1% for eon to 100% 
for vpr) and about 60% for the floating-point programs (from 12% for sixtrack to 
98% for applu). Going deeper into the numbers, about 50% of the memory CPI 
(25% overall) is due to L2 cache misses for the integer programs and L2 repre- 
sents about 70% of the memory CPI for the floating-point programs (40% over- 
all). As mentioned earlier, L2 misses are so long that it is difficult to hide them 
with extra work. 
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Benchmark Avg CPI Icache Deache L2 ITLB L1 DTLB Li IIB L2 DTLB L2 
TPC-C-like 2.57 18.34 13.89 6.18 3.25 9.00 0.09 1.71 
SPECint2000 total 1.30 0.90 14.27 3.57 0.25 12.47 0.00 1.06 
164.gzip 0.86 0.01 16.03 0.10 0.01 11.06 0.00 0.09 
175.vpr 1.78 0.02 23.36 5.73 0.01 50.52 0.00 3.22 
176.gcc 1.02 1.94 19.04 0.90 0.79 4.53 0.00 0.19 
181.mcf 13.06 0.02 148.90 103.82 0.01 50.49 0.00 26.98 
186.crafty 0.72 3.15 4.05 0.06 0.16 18.07 0.00 0.01 
197.parser 1.28 0.08 14.80 1.34 0.01 11.56 0.00 0.65 
252.eon 0.82 0.06 0.45 0.00 0.01 0.05 0.00 0.00 
253.perlbmk 0.70 1.36 2.41 0.43 0.93 3.51 0.00 0.31 
254.gap 0.86 0.76 4.27 0.58 0.05 3.38 0.00 0.33 
255.vortex 0.88 3.67 5.86 1.17 0.68 15.78 0.00 1.38 
256.bzip2 1.00 0.01 10.57 2.94 0.00 8.17 0.00 0.63 
300.twolf 1.85 0.08 26.18 4.49 0.02 14.79 0.00 0.01 
SPECfp2000 total 1.15 0.08 13.43 2.26 0.01 3.70 0.00 0.79 
168.wupwise 0.83 0.00 6.56 1.66 0.00 0.22 0.00 0.17 
171.swim 1.88 0.01 30.87 2.02 0.00 0.59 0.00 0.41 
172.mgrid 0.89 0.01 16.54 1.35 0.00 0.35 0.00 0.25 
173.applu 0.97 0.01 8.48 3.41 0.00 2.42 0.00 0.13 
177.mesa 0.78 0.03 158 0.13 0.01 8.78 0.00 0.17 
178.galgel 1.07 0.01 18.63 2.38 0.00 7.62 0.00 0.67 
179. art 3.03 0.00 56.96 8.27 0.00 1.20 0.00 0.41 
183.equake 2.35 0.06 37.29 3.30 0.00 1.20 0.00 0.59 
187.facerec 1.07 0.01 9.31 3.94 0.00 1.21 0.00 0.20 
188.ammp 1.19 0.02 16.58 2.37 0.00 8.61 0.00 3.25 
189. lucas 1.73 0.00 17.35 4.36 0.00 4.80 0.00 3.27 
191.fma3d 1.34 0.20 11.84 3.02 0.05 0.36 0.00 0.21 
200.sixtrack 0.63 0.03 0.53 0.16 0.01 0.66 0.00 0.01 
301.apsi 1.17 0.50 13.81 2.48 0.01 10.37 0.00 1.69 





Figure 5.20 CPI and misses per 1000 instructions for running a TPC-C-like database workload and the SPEC2000 
benchmarks on the AMD Opteron. Since the Opteron uses an out-of-order instruction execution, the statistics are 


calculated as the number of misses per 1000 instructions successfully committed. 
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Figure 5.21 Area plots that estimate CPI breakdown into base CPI, memory stalls, 
and pipeline stalls for SPECint2000 programs (plus a TPC-C-like benchmark) on the 
top and SPECfp2000 on the bottom. They are sorted from lowest to highest overall 
CPI. We estimated the memory CPI by multiplying the misses per instruction at the vari- 
ous levels by their miss penalties, and subtracted it and the base CPI from the measured 
CPI to calculate the pipeline stall CPI. The L2 miss penalty is 140 clock cycles, and all 
other misses hit in the L2 cache.This estimate assumes no overlapping of memory and 
execution, so the memory portion is high, as some of it is surely overlapped with pipe- 
line stalls and with other memory accesses. Since it would overwhelm the rest of the 
data with its CPI of 13.06, mcf is not included. Memory misses must be overlapped in 
mcf; otherwise the CPI would grow to 18.53. 


Finally, Figure 5.22 compares the miss rates of the data caches and the L2 
caches of Opteron to the Intel Pentium 4, showing the ratio of the misses per 
instruction for 10 SPEC2000 benchmarks. Although they are executing the same 
programs compiled for the same instruction set, the compilers and resulting code 
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Figure 5.22 Ratio of misses per instruction for Pentium 4 versus Opteron. Bigger 
means a higher miss rate for Pentium 4.The 10 programs are the first 5 SPECint2000 
and the first 5 SPECfp2000. (The two processors and their memory hierarchies are 
described in the table in the text.) The geometric mean of the ratio of performance of 
the 5 SPECint programs on the two processors is 1.00 with a standard deviation of 1.42; 
the geometric mean of the performance of the 5 SPECip programs suggests Opteron is 
1.15 times faster, with a standard deviation of 1.25. Note the clock rate for the Pentium 
4 was 3.2 GHz in these experiments; higher-clock-rate Pentium 4s were available but 
not used in this experiment. Figure 5.10 shows that half of these programs benefit sig- 
nificantly from the prefetching hardware of the Pentium 4: mcf, wupwise, swim, mgrid, 
and applu. 


sequences are different as are the memory hierarchies. The following table sum- 


marizes the two memory hierarchies: 














Processor Pentium 4 (3.2 GHz) Opteron (2.8 GHz) 

Data cache 8-way associative, 16 KB, 2-way associative, 64 KB, 
64-byte block 64-byte block 

L2 cache 8-way associative, 2 MB, 16-way associative, 1 MB, 
128-byte block, inclusive of 64-byte block, exclusive of D cache 
D cache 

Prefetch 8 streams to L2 1 stream to L2 





Although the Pentium 4 has much higher associativity, the four times larger 
data cache of Opteron has lower LI miss rates. The geometric mean of the ratios 
of LI miss rates is 2.25 and geometric standard deviation is 1.75 for the five 
SPECint2000 programs; they are 3.37 and 1.72 for the five SPECfp2000 pro- 
grams (see Chapter 1 to review geometric means and standard deviations). 
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With twice the L2 block size and L2 cache capacity and more aggressive 
prefetching, the Pentium 4 usually has fewer L2 misses per instruction. Surpris- 
ingly, the Opteron L2 cache has fewer on 4 of the 10 programs. This variability is 
reflected in the means and high standard deviations: the geometric mean and 
standard deviation of the ratios of L2 miss rates is 0.50 and 3.45 for the integer 
programs and 1.48 and 2.66 for the floating-point programs. As mentioned ear- 
lier, this nonintuitive result could simply be the consequence of using different 
compilers and optimizations. Another possible explanation is that the lower 
memory latency and higher memory bandwidth of the Opteron helps the effec- 
tiveness of its hardware prefetching, which is known to reduce misses on many of 
these floating-point programs. (See Figure 5.10 on page 306.) 


Fallacies and Pitfalls 


As the most naturally quantitative of the computer architecture disciplines, mem- 
ory hierarchy would seem to be less vulnerable to fallacies and pitfalls. Yet we 
were limited here not by lack of warnings, but by lack of space! 


Fallacy Predicting cache performance of one program from another. 


Figure 5.23 shows the instruction miss rates and data miss rates for three pro- 
grams from the SPEC2000 benchmark suite as cache size varies. Depending on 
the program, the data misses per thousand instructions for a 4096 KB cache is 9, 
2, or 90, and the instruction misses per thousand instructions for a 4 KB cache is 
55, 19, or 0.0004. Commercial programs such as databases will have significant 
miss rates even in large second-level caches, which is generally not the case for 
the SPEC programs. Clearly, generalizing cache performance from one program 
to another is unwise. 
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Figure 5.23 Instruction and data misses per 1000 instructions as cache size varies 
from 4 KB to 4096 KB. Instruction misses for gcc are 30,000 to 40,000 times larger than 
lucas, and conversely, data misses for lucas are 2 to 60 times larger than gcc. The pro- 
grams gap, gcc, and lucas are from the SPEC2000 benchmark suite. 
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Pitfall 


Pitfall 


Simulating enough instructions to get accurate performance measures of the 
memory hierarchy. 


There are really three pitfalls here. One is trying to predict performance of a large 
cache using a small trace. Another is that a program's locality behavior is not 
constant over the run of the entire program. The third is that a program's locality 
behavior may vary depending on the input. 

Figure 5.24 shows the cumulative average instruction misses per thousand 
instructions for five inputs to a single SPEC2000 program. For these inputs, the 
average memory rate for the first 19 billion instructions is very different from the 
average miss rate for the rest of the execution. 

The first edition of this book included another example of this pitfall. The 
compulsory miss ratios were erroneously high (e.g., 1%) because of tracing too 
few memory accesses. A program with a compulsory cache miss ratio of 1% run- 
ning on a computer accessing memory 10 million times per second (at the time of 
the first edition) would access hundreds of megabytes of memory per minute: 


10,000,000 accesses _ 0.01 misses | 32 bytes _ 60 seconds _ 192,000,000 bytes 


Second Access Miss Minute Minute 


Data on typical page fault rates and process sizes do not support the conclusion 
that memory is touched at this rate. 


Overemphasizing memory bandwidth in DRAMs. 


Several years ago, a startup named RAMBUS innovated on the DRAM interface. 
Its product, Direct RDRAM, offered up to 1.6 GB/sec of bandwidth from a single 
DRAM. When announced, the peak bandwidth was eight times faster than indi- 
vidual conventional SDRAM chips. Figure 5.25 compares prices of various ver- 
sions of DRAM and RDRAM, in memory modules and in systems. 

PCs do most memory accesses through a two-level cache hierarchy, so it was 
unclear how much benefit is gained from high bandwidth without also improving 
memory latency. According to Pabst [2000], when comparing PCs with 400 MHz 
DRDRAM to PCs with 133 MHz SDRAM, for office applications they had iden- 
tical average performance. For games, DRDRAM was 1% to 2% faster. For pro- 
fessional graphics applications, it was 10% to 15% faster. The tests used an 800 
MHz Pentium II (which integrates a 256 KB L2 cache), chip sets that support a 
133 MHz system bus, and 128 MB of main memory. 

One measure of the RDRAM cost is die size; it had about a 20% larger die for 
the same capacity compared to SDRAM. DRAM designers use redundant rows 
and columns to improve yield significantly on the memory portion of the DRAM, 
so a much larger interface might have a disproportionate impact on yield. Yields 
are a closely guarded secret, but prices are not. Using the evaluation in Figure 
5.25, in 2000 the price was about a factor of 2-3 higher for RDRAM. In 2006, 
the ratio is not less. 

RDRAM was at its strongest in small memory systems that need high band- 
width, such as a Sony Playstation. 


Pitfall 
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Figure 5.24 Instruction misses per 1000 references for five inputs to perl bench- 
mark from SPEC2000. There is little variation in misses and little difference between 
the five inputs for the first 19 billion instructions. Running to completion shows how 
misses vary over the life of the program and how they depend on the input. The top 
graph shows the running average misses for the first 19 billion instructions, which 
starts at about 2.5 and ends at about 4.7 misses per 1000 references for all five inputs. 
The bottom graph shows the running average misses to run to completion, which takes 
16-41 billion instructions depending on the input. After the first 19 billion instructions, 
the misses per 1000 references vary from 2.4 to 79 depending on the input. The simula- 
tions were for the Alpha processor using separate L1 caches for instructions and data, 
each two-way 64 KB with LRU, and a unified 1 MB direct-mapped L2 cache. 


Not delivering high memory bandwidth in a cache-based system 


Caches help with average cache memory latency but may not deliver high mem- 
ory bandwidth to an application that must go to main memory. The architect must 
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Modules Dell XPS PCs 
ECC? NoECC ECC No ECC ECC 
Label DIMM RIMM A B B-A C D D-C 
Memory or system? DRAM System DRAM System DRAM 
Memory size (MB) 256 256 128 $12 384 128 512 384 
SDRAM PC100 $175 $259 $1519 $2139 $620 $1559 $2269 $710 
DRDRAM PC700 $725 $826 $1689 $3009 $1320 $1789 $3409 $1620 
Price ratio DRDRAM/SDRAM 4.1 3.2 1.1 1.4 2.1 1.1 1.5 2.3 


Figure 5.25 Comparison of price of SDRAM versus DRDRAM in memory modules and in systems in 2000. 
DRDRAM memory modules cost about a factor of 4 more without ECC and 3 more with ECC. Looking at the cost of 
the extra 384 MB of memory in PCs in going from 128 MB to 512 MB, DRDRAM costs twice as much. Except for differ- 
ences in bandwidths of the DRAMs, the systems were identically configured. The Dell XPS PCs were identical except 
for memory: 800 MHz Pentium III, 20 GB ATA disk, 48X CD-ROM, 17-inch monitor, and Microsoft Windows 95/98 and 
Office 98. The module prices were the lowest found at pricewatch.com in June 2000. By September 2005, PC800 
DRDRAM cost $76 for 256 MB, while PC100 to PCI 50 SDRAM cost $15 to $23, or about a factor of 3.3 to 5.0 less 
expensive. (In September 2005 Dell did not offer systems whose only difference was type of DRAMs; hence, we stick 
with the comparison from 2000.) 


Pitfall 


design a high bandwidth memory behind the cache for such applications. As an 
extreme example, the NEC SX7 offers up to 16,384 interleaved SDRAM mem- 
ory banks. It is a vector computer that doesn't rely on data caches for memory 
performance (see Appendix F). Figure 5.26 shows the top 10 results from the 
Stream benchmark as of 2005, which measures bandwidth to copy data 
[McCalpin 2005]. Not surprisingly, the NEC SX7 has the top ranking. 

Only four computers rely on data caches for memory performance, and their 
memory bandwidth is a factor of 7-25 slower than the NEC SX7. 


Implementing a virtual machine monitor on an instruction set architecture that 
wasn't designed to be virtualizable. 


Many architects in the 1970s and 1980s weren't careful to make sure that all 
instructions reading or writing information related to hardware resource informa- 
tion were privileged. This laissez faire attitude causes problems for VMMs for all 
of these architectures, including the 80x86, which we use here as an example. 

Figure 5.27 describes the 18 instructions that cause problems for virtualization 
[Robin and Irvine 2000]. The two broad classes are instructions that 


e read control registers in user mode that reveals that the guest operating sys- 
tem is running in a virtual machine (such as POPF mentioned earlier), and 


e check protection as required by the segmented architecture but assume that 
the operating system is running at the highest privilege level. 
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Figure 5.26 Top 10 in memory bandwidth as measured by the untuned copy por- 
tion of the stream benchmark [McCalpin 2005].The number of processors is shown in 
parentheses. Two are cache-based clusters (SGI), two are cache-based SMPs (HP), but 
most are NEC vector processors of different generations and number of processors. Sys- 
tems use between 8 and 512 processors to achieve higher memory bandwidth. System 
bandwidth is bandwidth of all processors collectively. Processor bandwidth is simply 
system bandwidth divided by the number of processors. The STREAM benchmark is a 
simple synthetic benchmark program that measures sustainable memory bandwidth 
(in MB/sec) for simple vector kernels. It specifically works with data sets much larger 
than the available cache on any given system. 


Virtual memory is also challenging. Because the 80x86 TLBs do not support 
process ID tags, as do most RISC architectures, it is more expensive for the 
VMM and guest OSes to share the TLB; each address space change typically 
requires a TLB flush. 

Virtualizing I/O is also a challenge for the 80x86, in part because it both sup- 
ports memory-mapped I/O and has separate I/O instructions, but more impor- 
tantly, because there is a very large number and variety of types of devices and 
device drivers of PCs for the VMM to handle. Third-party vendors supply their 
own drivers, and they may not properly virtualize. One solution for conventional 
VM implementations is to load real device drivers directly into the VMM. 

To simplify implementations of VMMs on the 80x86, both AMD and Intel 
have proposed extensions to the architecture. Intel's VT-x provides a new execu- 
tion mode for running VMs, an architected definition of the VM state, instruc- 
tions to swap VMs rapidly, and a large set of parameters to select the 
circumstances where a VMM must be invoked. Altogether, VT-x adds 11 new 
instructions for the 80x86. AMD's Pacifica makes similar proposals. 

After turning on the mode that enables VT-x support (via the new VVWXN 
instruction), VT-x offers four privilege levels for the guest OS that are lower in 
priority than the original four. VT-x captures all the state of a virtual machine in 
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Problem category Problem 80x86 instructions 


Access sensitive registers without Store global descriptor table register (SGDT) 
trapping when running in user mode Store local descriptor table register (SLDT) 
Store interrupt descriptor table register (SIDT) 
Store machine status word (SMSW) 
Push flags (PUSHF, PUSHFD) 
Pop flags (POPF, POPFD) 


When accessing virtual memory Load access rights from segment descriptor (LAR) 
mechanisms in user mode, Load segment limit from segment descriptor (LSL) 
instructions fail the 80x86 Verify if segment descriptor is readable (VERR) 
protection checks Verify if segment descriptor is writable (VERW) 


Pop to segment register (POP CS, POP SS, .. .) 
Push segment register (PUSH CS, PUSH SS.,. ..) 
Far call to different privilege level (CALL) 

Far return to different privilege level (RET) 
Far jump to different privilege level (JMP) 
Software interrupt (INT) 

Store segment selector register (STR) 

Move to/from segment registers (MOVE) 


Figure 5.27 Summary of 18 80x86 instructions that cause problems for visualiza- 
tion [Robin and Irvine 2000].The first five instructions of the top group allow a pro- 
gram in user mode to read a control register, such as a descriptor table registers, 
without causing a trap.The pop flags instruction modifies a control register with sensi- 
tive information, but fails silently when in user mode. The protection checking of the 
segmented architecture of the 80x86 is the downfall of the bottom group, as each of 
these instructions checks the privilege level implicitly as part of instruction execution 
when reading a control register. The checking assumes that the OS must be at the high- 
est privilege level, which is not the case for guest VMs. Only the MOE to segment regis- 
ter tries to modify control state, and protection checking foils it as well. 


the Virtual Machine Control State (VMCS), and then provides atomic instruc- 
tions to save and restore a VMCS. In addition to critical state, the VMCS includes 
configuration information to determine when to invoke the VMM, and then spe- 
cifically what caused the VMM to be invoked. To reduce the number of times the 
VMM must be invoked, this mode adds shadow versions of some sensitive regis- 
ters and adds masks that check to see whether critical bits of a sensitive register 
will be changed before trapping. To reduce the cost of virtualizing virtual mem- 
ory, AMD's Pacifica adds an additional level of indirection, called nested page 
tables. It makes shadow page tables unnecessary. 

It is ironic that AMD and Intel are proposing a new mode. If operating sys- 
tems like Linux or Microsoft Windows start using that mode in their kernel, the 
new mode would cause performance problems for the VMM since it would be 
about 100 times too slow! Nevertheless, the Xen organization plans to use VT-x 
to allow it to support Windows as a guest OS. 
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Concluding Remarks 


Figure 5.28 compares the memory hierarchy of microprocessors aimed at desk- 
top and server applications. The LI caches are similar across applications, with 
the primary differences being L2 cache size, die size, processor clock rate, and 
instructions issued per clock. 

The design decisions at all these levels interact, and the architect must take 
the whole system view to make wise decisions. The primary challenge for the 
memory hierarchy designer is in choosing parameters that work well together, 







































































MPU AMD Opteron Intel Pentium 4 IBM Power 5 Sun Niagara 
Instruction set architecture 80x86 (64b) 80x86 PowerPC SPARC v9 
Intended application desktop desktop server server 
CMOS process (nm) 90 90 130 90 
Die size (mm) 199 217 389 379 
Instructions issued/clock 3 3 RISC ops 8 1 
Processors/chip 2 1 2 8 
Clock rate (2006) 2.8 GHz 3.6 GHz 2.0 GHz 1.2 GHz 
Instruction cache per processor 64KB, 12000 RISC op 64KB, 16KB, 
2-way set trace cache 2-way set l-way set 
associative (-96 KB) associative associative 
Latency LI I (clocks) 2 4 1 1 
Data cache 64KB, 16KB, 32KB, 8KB, 
per processor 2-way set 8-way set 4-way set l-way set 
associative associative associative associative 
Latency LI D (clocks) 2 2 2 1 
TLB entries (I/D/L2 I/L2 D) 40/40/5 12/512 128/54 1024/1024 64/64 
Minimum page size 4KB 4KB 4KB 8KB 
On-chip L2 cache 2x 1 MB, 2MB, 1.875 MB, 3 MB, 
16-way set 8-way set 10-way set 2-way set 
associative associative associative associative 
L2 banks 2 1 3 4 
Latency L2 (clocks) 7 22 13 221,23 D 
Off-chip L3 cache — — 36 MB, 12-way set — 
associative (tags on chip) 
Latency L3 (clocks) — — 87 — 
Block size (L1I/L1D/L2/L3, bytes) 64 64/64/128/— 128/128/128/256 32/16/64/— 
Memory bus width (bits) 128 64 64 128 
Memory bus clock 200 MHz 200 MHz 400 MHz 400 MHz 
Number of memory buses 1 1 4 4 





Figure 5.28 Memory hierarchy and chip size of desktop and server microprocessors in 2005. 
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not in inventing new techniques. The increasingly fast processors are spending a 
larger fraction of time waiting for memory, which has led to new inventions that 
have increased the number of choices: prefetching, cache-aware compilers, and 
increasing page size. Fortunately, there tends to be a few technological "sweet 
spots" in balancing cost, performance, power, and complexity: Missing a target 
wastes performance, power, hardware, design time, debug time, or possibly all 
five. Architects hit a target by careful, quantitative analysis. 


Historical Perspective and References 


In Section K.6 on the companion CD we examine the history of caches, virtual 
memory, and virtual machines. IBM plays a prominent role in the history of all 
three. References for further reading are included. 


Case Studies with Exercises by Norman P. Jouppi 


Case Study 1: Optimizing Cache Performance via Simple 
Hardware 


Concepts illustrated by this case study 


e Small and Simple Caches 

e Way Prediction 

e Pipelined Caches 

e Banked Caches 

e Merging Write Buffers 

e Critical Word First and Early Restart 
e Nonblocking Caches 


e Calculating Impact of Cache Performance on Simple In-Order Processors 


Imagine (unrealistically) that you are building a simple in-order processor that 
has a CPI of 1 for all nondata memory access instructions. In this case study we 
will consider the performance implications of small and simple caches, way-pre- 
dicting caches, pipelined cache access, banked caches, merging write buffers, and 
critical word first and early restart. Figure 5.29 shows SPEC2000 data miss ratios 
(misses per 1000 instructions) with the harmonic mean of the full execution of all 
benchmarks. 

CACTI is a tool for estimating the access and cycle time, dynamic and leak- 
age power, and area of a cache based on its lithographic feature size and cache 
organization. Of course there are many different possible circuit designs for a 


Case Studies with Exercises by Norman P. Jouppi ° 343 





D-cache misses/inst: 


2,521,022,899,870 data refs (0.32899-/inst); 





1,801,061,937,244 D-cache 64-byte block accesses (0.23289-/inst) 









































Size Direct 2-way LRU 4-way LRU 8-way LRU Full LRU 

1 KB 0.0863842- 0.0697167- 0.0634309- 0.0563450- 0.0533706- 

2 KB 0.0571524- 0.0423833- 0.0360463- 0.0330364- 0.0305213- 

4 KB 0.0370053- 0.0260286- 0.0222981- 0.0202763- 0.0190243- 

8 KB 0.0247760- 0.0155691- 0.0129609- 0.0107753- 0.0083886- 
16 KB 0.0159470- 0.0085658- 0.0063527- 0.0056438- 0.0050068- 
32 KB 0.0110603- 0.0056101- 0.0039190- 0.0034628- 0.0030885- 
64 KB 0.0066425- 0.0036625— 0.0009874- 0.0002666- 0.0000106- 
128 KB 0.0035823- 0.0002341- 0.0000109- 0.0000058- 0.0000058- 
256 KB 0.0026345- 0.0000092- 0.0000049- 0.0000051- 0.0000053- 
512 KB 0.0014791 — 0.0000065- 0.0000029- 0.0000029- 0.0000029- 

1 MB 0.0000090-- 0.0000058- 0.0000028- 0.0000028- 0.0000028- 





Figure 5.29 SPEC2000 data miss ratios (misses per 1000 instructions) [Cantin and Hill 2003]. 


5.1 


5.2 


given cache organization, and many different technologies for a given litho- 
graphic feature size, but CACTI assumes a "generic" organization and technol- 
ogy. Thus it may not be accurate for a specific cache design and technology in 
absolute terms, but it is fairly accurate at quantifying the relative performance of 
different cache organizations at different feature sizes. CACTI is available in an 
online form at http://quid.hpl.hp.com:9081/cacti/. Assume all cache misses take 
20 cycles. 


[12/12/15/15] <5.2> The following questions investigate the impact of small and 
simple caches using CACTI, and assume a 90 nm (0.09 um) technology. 


a. [12] <5.2> Compare the access times of 32 KB caches with 64-byte blocks 
and a single bank. What is the relative access times of two-way and four-way 
set associative caches in comparison to a direct-mapped organization? 


b. [12] <5.2> Compare the access times of two-way set-associative caches with 
64-byte blocks and a single bank. What is the relative access times of 32 KB 
and 64 KB caches in comparison to a 16 KB cache? 


c. [15] <5.2> Does the access time for a typical level 1 cache organization 
increase with size roughly as the capacity in bytes B, the square root of B, or 
the log of BI 


d. [15] <5.2> Find the cache organization with the lowest average memory 
access time given the miss ratio table in Figure 5.29 and a cache access time 
budget of 0.90 ns. What is this organization, and does it have the lowest miss 
rate of all organizations for its capacity? 


[12/15/15/10] <5.2> You are investigating the possible benefits of a way-predicting 
level 1 cache. Assume that the 32 KB two-way set-associative single-banked level 1 
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data cache is currently the cycle time limiter. As an alternate cache organization 
you are considering a way-predicted cache modeled as a 16 KB direct-mapped 
cache with 85% prediction accuracy. Unless stated otherwise, assume a mispre- 
dicted way access that hits in the cache takes one more cycle. 


a. [12] <5.2> What is the average memory access time of the current cache ver- 
sus the way-predicted cache? 


b. [15] <5.2> Ifall other components could operate with the faster way-predicted 
cache cycle time (including the main memory), what would be the impact on 
performance from using the way-predicted cache? 


c. [15] <5.2> Way-predicted caches have usually only been used for instruction 
caches that feed an instruction queue or buffer. Imagine you want to try out 
way prediction on a data cache. Assume you have 85% prediction accuracy, 
and subsequent operations (e.g., data cache access of other instructions, 
dependent operations, etc.) are issued assuming a correct way prediction. 
Thus a way misprediction necessitates a pipe flush and replay trap, which 
requires 15 cycles. Is the change in average memory access time per load 
instruction with data cache way prediction positive or negative, and how 
much is it? 


d. [10] <5.2> As an alternative to way prediction, many large associative level 2 
caches serialize tag and data access, so that only the required data set array 
needs to be activated. This saves power but increases the access time. Use 
CACTI's detailed Web interface for a 0.090 um process 1 MB four-way set- 
associative cache with 64-byte blocks, 144 bits read out, 1 bank, only 1 read/ 
write port, and 30-bit tags. What are the ratio of the total dynamic read ener- 
gies per access and ratio of the access times for serializing tag and data access 
in comparison to parallel access? 


[10/12/15] <5.2> You have been asked to investigate the relative performance of a 
banked versus pipelined level 1 data cache for a new microprocessor. Assume a 
64 KB two-way set-associative cache with 64 B blocks. The pipelined cache 
would consist of two pipe stages, similar to the Alpha 21264 data cache. A 
banked implementation would consist of two 32 KB two-way set-associative 
banks. Use CACTI and assume a 90 nm (0.09 pm) technology in answering the 
following questions. 


a. [10] <5.2> What is the cycle time of the cache in comparison to its access 
time, and how many pipe stages will the cache take up (to two decimal 
places)? 


b. [12] <5.2> What is the average memory access time if 20% of the cache 
access pipe stages are empty due to data dependencies introduced by pipelin- 
ing the cache and pipelining more finely doubles the miss penalty? 


c. [15] <5.2> What is the average memory access time of the banked design if 
there is a memory access each cycle and a random distribution of bank 
accesses (with no reordering) and bank conflicts cause a one-cycle delay? 
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[12/15] <5.2> Inspired by the usage of critical word first and early restart on level 
1 cache misses, consider their use on level 2 cache misses. Assume a 1 MB L2 
cache with 64-byte blocks and a refill path that is 16 bytes wide. Assume the L2 
can be written with 16 bytes every 4 processor cycles, the time to receive the first 
16-byte block from the memory controller is 100 cycles, each additional 16 B 
from main memory requires 16 cycles and data can be bypassed directly into the 
read port of the L2 cache. Ignore any cycles to transfer the miss request to the 
level 2 cache and the requested data to the level 1 cache. 


a. [12] <5.2> How many cycles would it take to service a level 2 cache miss 
with and without critical word first and early restart? 


b. [15] <5.2> Do you think critical word first and early restart would be more 
important for level 1 caches or level 2 caches, and what factors would con- 
tribute to their relative importance? 


[10/12] <5.2> You are designing a write buffer between a write-through level 1 
cache and a write-back level 2 cache. The level 2 cache write data bus is 16 bytes 
wide and can perform a write to an independent cache address every 4 processor 
cycles. 

a. [10] <5.2> How many bytes wide should each write buffer entry be? 

b. [12] <5.2> What speedup could be expected in the steady state by using a 
merging write buffer instead of a nonmerging buffer when zeroing memory 
by the execution of 32-bit stores if all other instructions could be issued in 
parallel with the stores and the blocks are present in the level 2 cache? 


Case Study 2: Optimizing Cache Performance via Advanced 
Techniques 


Concepts illustrated by this case study 


m Nonblocking Caches 
e Compiler Optimizations for Caches 
e Software and Hardware Prefetching 


e Calculating Impact of Cache Performance on More Complex Processors 


The transpose of a matrix interchanges its rows and columns and is illustrated 
below: 


[A11 A12 A13 A14 [A11 A21 A31 aiil 
A2] A22 A23 A24 pa: iraz A22 A32 A42 
[A31 A32 A33 A34 A13 A23 A33 A43 
[A41 A42 A43 A44 [A14 A24 A34 A44 
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5.6 


5.7 


5.8 


Here is a simple C loop to show the transpose: 


for G = 0; i < 3; i++) { 
for G = 0; j < 3; j++) { 
output [j][i] = input[i]j]; 
} 
} 


Assume both the input and output matrices are stored in the row major order 
(row major order means row index changes fastest). Assume you are executing a 
256 x 256 double-precision transpose on a processor with a 16 KB fully associa- 
tive (so you don't have to worry about cache conflicts) LRU replacement level 1 
data cache with 64-byte blocks. Assume level 1 cache misses or prefetches 
require 16 cycles, always hit in the level 2 cache, and the level 2 cache can pro- 
cess a request every 2 processor cycles. Assume each iteration of the inner loop 
above requires 4 cycles if the data is present in the level 1 cache. Assume the 
cache has a write-allocate fetch-on-write policy for write misses. Unrealistically 
assume writing back dirty cache blocks require 0 cycles. 


[10/15/15] <5.2> For the simple implementation given above, this execution 
order would be nonideal for the input matrix. However, applying a loop inter- 
change optimization would create a nonideal order for the output matrix. Because 
loop interchange is not sufficient to improve its performance, it must be blocked 
instead. 


a. [10] <5.2> What block size should be used to completely fill the data cache 
with one input and output block? 


b. [15] <5.2> How do the relative number of misses of the blocked and 
unblocked versions compare if the level 1 cache is direct mapped? 


c. [15] <5.2> Write code to perform a transpose with a block size parameter B 
that uses BxB blocks. 


[12] <5.2> Assume you are redesigning a hardware prefetcher for the unblocked 
matrix transposition code above. The simplest type of hardware prefetcher only 
prefetches sequential cache blocks after a miss. More complicated "nonunit 
stride" hardware prefetchers can analyze a miss reference stream, and detect and 
prefetch nonunit strides. In contrast, software prefetching can determine nonunit 
strides as easily as it can determine unit strides. Assume prefetches write directly 
into the cache and no pollution (overwriting data that needs to be used before the 
data that is prefetched). In the steady state of the inner loop, what is the perfor- 
mance (in cycles per iteration) when using an ideal nonunit stride prefetcher? 


[15/15] <5.2> Assume you are redesigning a hardware prefetcher for the 
unblocked matrix transposition code as in Exercise 5.7. However, in this case we 
evaluate a simple two-stream sequential prefetcher. If there are level 2 access 
slots available, this prefetcher will fetch up to 4 sequential blocks after a miss and 
place them in a stream buffer. Stream buffers that have empty slots obtain access 
to the level 2 cache on a round-robin basis. On a level 1 miss, the stream buffer 


Case Studies with Exercises by Norman P. Jouppi... . 347 


that has least recently supplied data on a miss is flushed and reused for the new 

miss stream. 

a. [15] <5.2> In the steady state of the inner loop, what is the performance (in 
cycles per iteration) when using a simple two-stream sequential prefetcher 
assuming performance is limited by prefetching? 


b. [15] <5.2> What percentage of prefetches are useful given the level 2 cache 
parameters? 

5.9 = [12/15] <5.2> With software prefetching it is important to be careful to have the 
prefetches occur in time for use, but also minimize the number of outstanding 
prefetches, in order to live within the capabilities of the microarchitecture and 
minimize cache pollution. This is complicated by the fact that different proces- 
sors have different capabilities and limitations. 


a. [12] <5.2> Modify the unblocked code above to perform prefetching in soft- 
ware. 


b. [15] <5.2> What is the expected performance of unblocked code with soft- 
ware prefetching? 


Case Study 3: Main Memory Technology and Optimizations 


Concepts illustrated by this case study 


e Memory System Design: Latency, Bandwidth, Cost, and Power 


e Calculating Impact of Memory System Performance 


Using Figure 5.14, consider the design of a variety of memory systems. Assume a 
chip multiprocessor with eight 3 GHz cores and directly attached memory control- 
lers (i.e., integrated northbridge) as in the Opteron. The chip multiprocessor (CMP) 
contains a single shared level 2 cache, with misses from that level going to main 
memory (i.e., no level 3 cache). A sample DDR2 SDRAM timing diagram appears 
in Figure 5.30. fra is the time required to activate a row in a bank, while the CAS 
latency (CL) is the number of cycles required to read out a column in a row. 
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Figure 5.30 DDR2 SDRAM timing diagram. 
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5.12 


Assume the RAM is on a standard DDR2 DIMM with ECC, having 72 data lines. 
Also assume burst lengths of 8 which read out 8 bits per data line, or a total of 32 
bytes from the DIMM. Assume the DRAMs have a 1 KB page size, 8 banks, ?acp = 
CL * Clock_frequency, and Clock_frequency = Transfers_per_second/2. The on- 
chip latency on a cache miss through levels 1 and 2 and back not including the 
DRAM access is 20 ns. Assume a DDR2-667 1 GB DIMM with CL = 5 is available 
for $130 and a DDR2-533 1 GB DIMM with CL = 4 is available for $100. (See 
http://download.micron.com/pdf/technotes/ddr2/TN4702.pdf for more details on 
DDR2 memory organization and timing.) 





[10/10/10/12/12] <5.3> Assume the system is your desktop PC and only one core 
on the CMP is active. Assume there is only one memory channel. 


a. [10] <5.3> How many DRAMs are on the DIMM if 512 Mbit DRAMs are 
used, and how many data I/Os must each DRAM have if only one DRAM 
connects to each DIMM data pin? 


b. [10] <5.3> What burst length is required to support 32-byte versus 64-byte 
level 2 cache blocks? 


c. [10] <5.3> What is the peak bandwidth ratio between the DIMMs for reads 
from an active page? 


d. [12] <5.3> How much time is required from the presentation of the activate 
command until the last requested bit of data from the DRAM transitions from 
valid to invalid for the DDR2-533 1 GB CL = 4 DIMM? 


e. [12] <5.3> What is the relative latency when using the DDR2-533 DIMM of 
a read requiring a bank activate versus one to an already open page, including 
the time required to process the miss inside the processor? 


[15] <5.3> Assume just one DIMM is used in a system, and the rest of the system 
costs $800. Consider the performance of the system using the DDR2-667 and 
DDR2-533 DIMMs on a workload with 3.33 level 2 misses per IK instructions, 
and assume all DRAM reads require an activate. Assume all 8 cores are active 
with the same workload. What is the cost divided by the performance of the 
whole system when using the different DIMMs assuming only one level 2 miss is 
outstanding at a time and an in-order core with a CPI of 15 not including level 2 
cache miss memory access time? 


[12] <5.3> You are provisioning a server based on the system above. All 8 cores 
on the CMP will be busy with an overall CPI of 2.0 (assuming level 2 cache miss 
refills are not delayed). What bandwidth is required to support all 8 cores running 
a workload with 6.67 level 2 misses per IK instructions, and optimistically 
assuming misses from all cores are uniformly distributed in time? 


[12] <5.3> A large amount (more than a third) of DRAM power can be due to page 
activation (see http://download.micron.com/pdf/technotes/ddr2/IN4704.pdf and 
http://www.micron.com/systemcalc). Assume you are building a system with 1 GB 
of memory using either 4-bank 512 Mbit x 4 DDR2 DRAMs or 8-bank 1 Gbit x 8 
DRAMs, both with the same speed grade. Both use a page size of 1 KB. Assume 
DRAMsSs that are not active are in precharged standby and dissipate negligible 
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power. Assume the time to transition from standby to active is not significant. 
Which type of DRAM would be expected to result in lower power? Explain why. 


Case Study 4: Virtual Machines 


Concepts illustrated by this case study 


m Capabilities Provided by Virtual Machines 
e Impact of Virtualization on Performance 


e Features and Impact of Architectural Extensions to Support Virtualization 


Intel and AMD have both created extensions to the x86 architecture to address 
the shortcomings of the x86 for virtualization. Intel's solution is called VT-x (Vir- 
tualization Technology x86) (see IEEE [2005] for more information on VT-x), 
while AMD's is called Secure Visual Machine (SVM). Intel has a corresponding 
technology for the Itanium architecture called VT-i. Figure 5.31 lists the early 
performance of various system calls under native execution, pure virtualization, 
and paravirtualization for LMbench using Xen on an Itanium system with times 
measured in microseconds (courtesy of Matthew Chapman of the University of 
New South Wales). 


[10/10/10/10/10] <5.4> Virtual machines have the potential for adding many ben- 
eficial capabilities to computer systems, for example, resulting in improved total 
cost of ownership (TCO) or availability. Could VMs be used to provide the fol- 
lowing capabilities? If so, how could they facilitate this? 


a. [10] <5.4> Make it easy to consolidate a large number of applications run- 
ning on many old uniprocessor servers onto a single higher-performance 
CMP-based server? 


b. [10] <5.4> Limit damage caused by computer viruses, worms, or spy ware? 


c. [10] <5.4> Higher performance in memory-intensive applications with large 
memory footprints? 


d. [10] <5.4> Dynamically provision extra capacity for peak application loads? 


e. [10] <5.4> Run a legacy application on old operating systems on modern 
machines? 


[10/10/12/12] <5.4> Virtual machines can lose performance from a number of 
events, such as the execution of privileged instructions, TLB misses, traps, and 
I/O. These events are usually handled in system code. Thus one way of estimat- 
ing the slowdown when running under a VM is the percentage of application 
execution time in system versus user mode. For example, an application spend- 
ing 10% of its execution in system mode might slow down by 60% when run- 
ning on a VM. 


a. [10] <5.4> What types of programs would be expected to have larger slow- 
downs when running under VMs? 
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Benchmark Native Pure Para 
Null call 0.04 0.96 0.50 
Null /O 0.27 6.32 2.91 
Stat 1.10 10.69 4.14 
Open/close 1.99 20.43 7.71 
Install sighandler 0.33 7.34 2,89 
Handle signal 1.69 19.26 2.36 
Fork 56.00 513.00 164.00 
Exec 316.00 2084.00 578.00 
Fork + exec sh 1451.00 7790.00 2360.00 





Figure 5.31 Early performance of various system calls under native execution, pure 
virtualization, and paravirtualization. 


b. [10] <5.4> If slowdowns were linear as a function of system time, given the 
slowdown above, how much slower would a program spending 30% of its 
execution in system time be expected to run? 


c. [12] <5.4> What is the mean slowdown of the functions in Figure 5.31 under 
pure and para virtualization? 


d. [12] <5.4> Which functions in Figure 5.31 have the smallest slowdowns? 
What do you think the cause of this could be? 


[12] <5.4> Popek and Goldberg's definition of a virtual machine said that it 
would be indistinguishable from a real machine except for its performance. In 
this exercise we'll use that definition to find out if we have access to native execu- 
tion on a processor or are running on a virtual machine. The Intel VT-x technol- 
ogy effectively provides a second set of privilege levels for the use of the virtual 
machine. What would happen to relative performance of a virtual machine if it 
was running on a native machine or on another virtual machine given two sets of 
privilege levels as in VT-x? 


[15/20] <5.4> With the adoption of virtualization support on the x86 architecture, 
virtual machines are actively evolving and becoming mainstream. Compare and 
contrast the Intel VT-x and AMD Secure Virtual Machine (SVM) virtualization 
technologies. Information on AMD's SVM can be found in Attp:/~www.amd.com/ 
us-en/assets/contentjtype/white_papers_andJech_docsZ24593. pdf. 








a. [15] <5.4> How do VT-x and SVM handle privilege-sensitive instructions? 


b. [20] <5.4> What do VT-x and SVM do to provide higher performance for 
memory-intensive applications with large memory footprints? 
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Case Study 5: Putting It All Together: Highly Parallel Memory 
Systems 


Concept illustrated by this case study 


e Understanding the Impact of Memory System Design Tradeoffs on Machine 
Performance 


The program in Figure 5.32 can be used to evaluate the behavior of a memory 
system. The key is having accurate timing and then having the program stride 
through memory to invoke different levels of the hierarchy. Figure 5.32 is the 
code in C. The first part is a procedure that uses a standard utility to get an accu- 
rate measure of the user CPU time; this procedure may need to change to work 
on some systems. The second part is a nested loop to read and write memory at 
different strides and cache sizes. To get accurate cache timing this code is 
repeated many times. The third part times the nested loop overhead only so that it 
can be subtracted from overall measured times to see how long the accesses were. 
The results are output in .csn file format to facilitate importing into spreadsheets. 
You may need to change CACHEMAX depending on the question you are answer- 
ing and the size of memory on the system you are measuring. Running the pro- 
gram in single-user mode or at least without other active applications will give 
more consistent results. The code in Figure 5.32 was derived from a program 
written by Andrea Dusseau of U.C. Berkeley and was based on a detailed 
description found in Saavedra-Barrera [1992]. It has been modified to fix a num- 
ber of issues with more modern machines and to run under Microsoft Visual 
C++. 

The program shown in Figure 5.32 assumes that program addresses track 
physical addresses, which is true on the few machines that use virtually 
addressed caches, such as the Alpha 21264. In general, virtual addresses tend to 
follow physical addresses shortly after rebooting, so you may need to reboot the 
machine in order to get smooth lines in your results. To answer the exercises, 
assume that the sizes of all components of the memory hierarchy are powers of 2. 
Assume that the size of the page is much larger than the size of a block in a sec- 
ond-level cache (if there is one), and the size of a second-level cache block is 
greater than or equal to the size of a block in a first-level cache. An example of 
the output of the program is plotted in Figure 5.33, with the key listing the size of 
the array that is exercised. 


10/12/12/12/12] <5.6> Using the sample program results in Figure 5.33: 


Co 


a. [10] <5.6> How many levels of cache are there? 

b. [12] <5.6> What are the overall size and block size of the first-level cache? 
c. [12] <5.6> What is the miss penalty of the first-level cache? 

d. [12] <5.6> What is the associativity of the first-level cache? 

e. [12] <5.6> What effects can you see when the size of the data used in the 


array is equal to the size of the second-level cache? 
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#include “stdafx.h" 

#include <stdio.h> 

include <time.h> 

#define ARRAY MIN (1024) /* 1/4 smallest cache es 
#define ARRAY MAX (4096*4096) /* 1/4 largest cache */ 
int x[ARRAY_ MAX]; /* array going to stride through */ 


double get seconds() { /* routine to read time in seconds */ 
time64 t ltime; 
Ment ateme s 
Teturn (double) Itime; 


int label(int i) {/* generate text labels */ 
if (i<le3) printf("*ldB,",i); 
else if ais printf("%1dk,",i/1024); 
else if (i<le9) printf Hb OR SSIS 
else printf ("%1dG,",i1/1073741824 
return 0; 


} 

int tmain(int argc, TCHAR* argv[]) { 

int register nextstep, i, index, stride; 

int csize; 

double steps, tsteps; 

double loadtime, lastsec, sec0, secl, sec; /* timing variables */ 


/* Initialize output */ 

printf(" ,"); 

for (stride=1; stride <= ARRAY MAX/2; stride=stride*2) 
label (stride*sizeof(int)); 

printf("\n"); 


/* Main loop for each configuration a 

for (csize=ARRAY MIN; csize <= ARRAY MAX; csize=csize*2) { 
label (csize*sizeof(int)); /* print cache size this loop */ 
for (stride=1; stride <= csize/2; stride-stride*2) { 


/* Lay out path of memory references in array zi 

for (index=0; index < csize; index=indextstride 
x[index] = index + stride; /* pointer to next */ 

x[index-stride] = 0; /* loop back to beginning */ 


* Wait for timer to roll over */ 
astsec = a caci; 
do sec0 = get_seconds{); while (secO == lastsec); 


/* Walk through path in array for twenty seconds */ 

/* This a 5% accuracy with second resolution */ 

steps = 0.0; /* number of steps taken */ 

nextstep = 0; /* start at beginning of path */ 

sec0 = pet seconds tH /* start timer */ 

do { /* repeat until collect 20 seconds */ 

for (i=stride;i!=0;i=i-1) { /* keep samples same */ 

Oe eater ET A I 
lo nextstep = x[nextstep]; ependenc 
while (nextstep != 0); Play p y 


steps = steps + 1.0; /* count loop iterations */ 
secl = get seconds (j; be end timer * 

} while ((secl - secO) < 20.0); /* collect 20 seconds */ 

sec = secl - sec0; 


/* Repeat empty loop to loop subtract overhead */ 
tsteps = 0.0; /* used to match no. while iterations */ 
secO = get_seconds(); /* start timer */ 
do { /* repeat until same no. iterations as above */ 
for (i=stride;i!=0;i=i-1) { /* keep samples same */ 
index = 0; 
do index = index + stride; 
while (index < csize); 


tsteps = tsteps + 1.0; 
secl = get_seconds(); i - overhead */ 
} while (tsteps<steps); until = no. iterations */ 
sec = sec - (secl - sec); 
loadtime = (sec*le9)/(steps*csize) ; 
/* write out results in .csv format for Excel */ 
printf("%4.1f,", (loadtime<0.1) ? 0.1 : loadtime); 
}; /* end of inner for loop */ 
print ("va }3 
}; /* end of outer for loop */ 
return 0; 


Figure 5.32 C program for evaluating memory systems. 
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4B 16B 64B 256B 1K 4K 16K 64K 256K 1M 4M 16M 64M 256M 
Stride 


Figure 5.33 Sample results from program in Figure 5.32. 


[15/20/25] <5.6> Modify the code in Figure 5.32 to measure the following sys- 
tem characteristics. Plot the experimental results with elapsed time on the y-axis 
and the memory stride on the x-axis. Use logarithmic scales for both axes, and 
draw a line for each cache size. 


a. [15] <5.6> Is the LI cache write-through or write-back? 
b. [20] <5.6> Is the memory system blocking or nonblocking? 


c. [25] <5.6> For a nonblocking memory system, how many outstanding mem- 
ory references can be supported? 


[25/25] <5.6> In multiprocessor memory systems, lower levels of the memory 
hierarchy may not be able to be saturated by a single processor, but should be 
able to be saturated by multiple processors working together. Modify the code in 
Figure 5.32, and run multiple copies at the same time. Can you determine: 


a. [25] <5.6> How much bandwidth does a shared level 2 or level 3 cache (if 
present) provide? 


b. [25] <5.6> How much bandwidth does the main memory system provide? 


[30] <5.6> Since instruction-level parallelism can also be effectively exploited 
on in-order superscalar processors and VLIWs with speculation, one important 
reason for building an out-of-order (OOO) superscalar processor is the ability 
to tolerate unpredictable memory latency caused by cache misses. Hence, you 
can think about hardware supporting OOO issue as being part of the memory 
system! Look at the floorplan of the Alpha 21264 in Figure 5.34 to find the 
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relative area of the integer and floating-point issue queues and mappers versus 
the caches. The queues schedule instructions for issue, and the mappers rename 
register specifiers. Hence these are necessary additions to support OOO issue. 
The 21264 only has level 1 data and instruction caches on chip, and they are 
both 64 KB two-way set associative. Use an OOO superscalar simulator such 
as Simplescalar (www.cs.wisc.edu/~mscalar/simplescalar.html) on memory- 
intensive benchmarks to find out how much performance is lost if the area of 
the issue queues and mappers is used for additional level 1 data cache area in an 
in-order superscalar processor, instead of OOO issue in a model of the 21264. 
Make sure the other aspects of the machine are as similar as possible to make 
the comparison fair. Ignore any increase in access or cycle time from larger 
caches and effects of the larger data cache on the floorplan of the chip. (Note 
this comparison will not be totally fair, as the code will not have been sched- 
uled for the in-order processor by the compiler.) 
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Figure 5.34 Floorplan ofthe Alpha 21264 [Kessler 1999]. 
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I think Silicon Valley was misnamed. If you look back at the dollars 
shipped in products in the last decade,there has been more revenue 
from magnetic disks than from silicon. They ought to rename the place 
Iron Oxide Valley. 

Al Hoagland 


a pioneer of magnetic disks 
(1982) 


Combining bandwidth and storage ...enables swift and reliable access 
to the ever expanding troves of content on the proliferating disks and 
... repositories of the Internet.... the capacity of storage arrays of all 
kinds is rocketing ahead of the advance of computer performance. 
George Gilder 


"The End Is Drawing Nigh," 
Forbes ASAP (April 4,2000) 
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6.1 


Introduction 


The popularity of Internet services like search engines and auctions has enhanced 
the importance of I/O for computers, since no one would want a desktop com- 
puter that couldn't access the Internet. This rise in importance of I/O is reflected 
by the names of our times. The 1960s to 1980s were called the Computing Revo- 
lution; the period since 1990 has been called the Information Age, with concerns 
focused on advances in information technology versus raw computational power. 
Internet services depend upon massive storage, which is the focus of this chapter, 
and networking, which is the focus of Appendix E. 

This shift in focus from computation to communication and storage of infor- 
mation emphasizes reliability and scalability as well as cost-performance. 
Although it is frustrating when a program crashes, people become hysterical if 
they lose their data. Hence, storage systems are typically held to a higher stan- 
dard of dependability than the rest of the computer. Dependability is the bedrock 
of storage, yet it also has its own rich performance theory—queuing theory—that 
balances throughput versus response time. The software that determines which 
processor features get used is the compiler, but the operating system usurps that 
role for storage. 

Thus, storage has a different, multifaceted culture from processors, yet it is 
still found within the architecture tent. We start our exploration with advances in 
magnetic disks, as they are the dominant storage device today in desktop and 
server computers. We assume readers are already familiar with the basics of stor- 
age devices, some of which were covered in Chapter 1. 


Advanced Topics in Disk Storage 


The disk industry historically has concentrated on improving the capacity of 
disks. Improvement in capacity is customarily expressed as improvement in 
areal density, measured in bits per square inch: 


Tracks 


; : Bits 
on a disk surface x —— on a track 
Inch Inch 


Areal density = 


Through about 1988, the rate of improvement of areal density was 29% per 
year, thus doubling density every three years. Between then and about 1996, 
the rate improved to 60% per year, quadrupling density every three years and 
matching the traditional rate of DRAMs. From 1997 to about 2003, the rate 
increased to 100%, or doubling every year. After the innovations that allowed 
the renaissances had largely played out, the rate has dropped recently to about 
30% per year. In 2006, the highest density in commercial products is 130 bil- 
lion bits per square inch. Cost per gigabyte has dropped at least as fast as areal 
density has increased, with smaller diameter drives playing the larger role in 
this improvement. Costs per gigabyte improved by a factor of 100,000 between 
1983 and 2006. 
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Magnetic disks have been challenged many times for supremacy of secondary 
storage. Figure 6.1 shows one reason: the fabled access time gap between disks 
and DRAM. DRAM latency is about 100,000 times less than disk, and that per- 
formance advantage costs 30-150 times more per gigabyte for DRAM. 

The bandwidth gap is more complex. For example, a fast disk in 2006 trans- 
fers about 115 MB/sec from the disk media with 37 GB of storage and costs 
about $150 (as we will see later in Figure 6.3). A 2 GB DRAM module costing 
about $300 in 2006 could transfer at 3200 MB/sec (see Section 5.3 in Chapter 5), 
giving the DRAM module about 28 times higher bandwidth than the disk. How- 
ever, the bandwidth per GB is 500 times higher for DRAM, and the bandwidth 
per dollar is 14 times higher. 

Many have tried to invent a technology cheaper than DRAM but faster than 
disk to fill that gap, but thus far, all have failed. Challengers have never had a 
product to market at the right time. By the time a new product would ship, 
DRAMs and disks have made advances as predicted earlier, costs have dropped 
accordingly, and the challenging product is immediately obsolete. 

The closest challenger is flash memory. This semiconductor memory is non- 
volatile like disks, and it has about the same bandwidth as disks, but latency is 
100-1000 times faster than disk. In 2006, the price per gigabyte of flash was 
about the same as DRAM. Flash is popular in cameras and portable music play- 
ers because it comes in much smaller capacities and it is more power efficient 
than disks, despite the cost per gigabyte being 50 times higher than disks. Unlike 
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Figure 6.1 Cost versus access time for DRAM and magnetic disk in 1980,1985,1990,1995,2000, and 2005. The 
two-order-of-magnitude gap in cost and five-order-of-magnitude gap in access times between semiconductor 
memory and rotating magnetic disks has inspired a host of competing technologies to try to fill them. So far, such 
attempts have been made obsolete before production by improvements in magnetic disks, DRAMs, or both. Note 
that between 1990 and 2005 the cost per gigabyte DRAM chips made less improvement, while disk cost made dra- 
matic improvement. 
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disks and DRAM, flash memory bits wear out—typically limited to 1 million 
writes—and so they are not popular in desktop and server computers. 

While disks will remain viable for the foreseeable future, the conventional 
sector-track-cylinder model did not. The assumptions of the model are that 
nearby blocks are on the same track, blocks in the same cylinder take less time to 
access since there is no seek time, and some tracks are closer than others. 

First, disks started offering higher-level intelligent interfaces, like ATA and 
SCSI, when they included a microprocessor inside a disk. To speed up sequential 
transfers, these higher-level interfaces organize disks more like tapes than like 
random access devices. The logical blocks are ordered in serpentine fashion 
across a single surface, trying to capture all the sectors that are recorded at the 
same bit density. (Disks vary the recording density since it is hard for the elec- 
tronics to keep up with the blocks spinning much faster on the outer tracks, and 
lowering linear density simplifies the task.) Hence, sequential blocks may be on 
different tracks. We will see later in Figure 6.22 on page 401 an illustration of the 
fallacy of assuming the conventional sector-track model when working with 
modern disks. 

Second, shortly after the microprocessors appeared inside disks, the disks 
included buffers to hold the data until the computer was ready to accept it, and 
later caches to avoid read accesses. They were joined by a command queue that 
allowed the disk to decide in what order to perform the commands to maximize 
performance while maintaining correct behavior. Figure 6.2 shows how a queue 
depth of 50 can double the number of I/Os per second of random I/Os due to bet- 
ter scheduling of accesses. Although it's unlikely that a system would really have 
256 commands in a queue, it would triple the number of I/Os per second. Given 
buffers, caches, and out-of-order accesses, an accurate performance model of a 
real disk is much more complicated than sector-track-cylinder. 
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Figure 6.2 Throughput versus command queue depth using random 512-byte 
reads. The disk performs 170 reads per second starting at no command queue, and 
doubles performance at 50 and triples at 256 [Anderson 2003]. 
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Finally, the number of platters shrank from 12 in the past to 4 or even 1 today, 
so the cylinder has less importance than before since the percentage of data in a 
cylinder is much less. 


Disk Power 


Power is an increasing concern for disks as well as for processors. A typical ATA 
disk in 2006 might use 9 watts when idle, 11 watts when reading or writing, and 
13 watts when seeking. Because it is more efficient to spin smaller mass, smaller- 
diameter disks can save power. One formula that indicates the importance of rota- 
tion speed and the size of the platters for the power consumed by the disk motor 
is the following [Gurumurthi 2005]: 


: 4.6 2.8 ; 
Power = Diameter” x RPM” x Number of platters 


Thus, smaller platters, slower rotation, and fewer platters all help reduce disk 
motor power, and most of the power is in the motor. 

Figure 6.3 shows the specifications of two 3.5-inch disks in 2006. The Serial 
ATA (SATA) disks shoot for high capacity and the best cost per gigabyte, and so 
the 500 GB drives cost less than $1 per gigabyte. They use the widest platters that 
fit the form factor and use four or five of them, but they spin at 7200 RPM and 
seek relatively slowly to lower power. The corresponding Serial Attach SCSI 
(SAS) drive aims at performance, and so it spins at 15,000 RPM and seeks much 
faster. To reduce power, the platter is much narrower than the form factor and it 
has only a single platter. This combination reduces capacity of the SAS drive to 
37 GB. 

The cost per gigabyte is about a factor of five better for the SATA drives, and 
conversely, the cost per I/O per second or MB transferred per second is about a 
factor of five better for the SAS drives. Despite using smaller platters and many 
fewer of them, the SAS disks use twice the power of the SATA drives, due to the 
much faster RPM and seeks. 
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Figure 6.3 Serial ATA (SATA) versus Serial Attach SCSI (SAS) drives in 3.5-inch form factor in 2006.The l/Os per 
second are calculated using the average seek plus the time for one-half rotation plus the time to transfer one sector 


of 512 KB. 
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Advanced Topics in Disk Arrays 


An innovation that improves both dependability and performance of storage sys- 
tems is disk arrays. One argument for arrays is that potential throughput can be 
increased by having many disk drives and, hence, many disk arms, rather than 
fewer large drives. Simply spreading data over multiple disks, called striping, 
automatically forces accesses to several disks if the data files are large. (Although 
arrays improve throughput, latency is not necessarily improved.) As we saw in 
Chapter 1, the drawback is that with more devices, dependability decreases: N 
devices generally have UN the reliability of a single device. 

Although a disk array would have more faults than a smaller number of larger 
disks when each disk has the same reliability, dependability is improved by add- 
ing redundant disks to the array to tolerate faults. That is, if a single disk fails, the 
lost information is reconstructed from redundant information. The only danger is 
in having another disk fail during the mean time to repair (MTTR). Since the 
mean time to failure (MTTF) of disks is tens of years, and the MTTR is mea- 
sured in hours, redundancy can make the measured reliability of many disks 
much higher than that of a single disk. 

Such redundant disk arrays have become known by the acronym RAID, stand- 
ing originally for redundant array of inexpensive disks, although some prefer 
the word independent for / in the acronym. The ability to recover from failures 
plus the higher throughput, either measured as megabytes per second or as I/Os 
per second, makes RAID attractive. When combined with the advantages of 
smaller size and lower power of small-diameter drives, RAIDs now dominate 
large-scale storage systems. 

Figure 6.4 summarizes the five standard RAID levels, showing how eight 
disks of user data must be supplemented by redundant or check disks at each 
RAID level, and lists the pros and cons of each level. The standard RAID levels 
are well documented, so we will just do a quick review here and discuss 
advanced levels in more depth. 


e RAID 0—It has no redundancy and is sometimes nicknamed JBOD, for "just a 
bunch of disks," although the data may be striped across the disks in the array. 
This level is generally included to act as a measuring stick for the other RAID 
levels in terms of cost, performance, and dependability. 


e RAID ĮI—Also called mirroring or shadowing, there are two copies of every 
piece of data. It is the simplest and oldest disk redundancy scheme, but it also 
has the highest cost. Some array controllers will optimize read performance 
by allowing the mirrored disks to act independently for reads, but this optimi- 
zation means it may take longer for the mirrored writes to complete. 


e RAID 2—This organization was inspired by applying memory-style error cor- 
recting codes to disks. It was included because there was such a disk array 
product at the time of the original RAID paper, but none since then as other 
RAID organizations are more attractive. 
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space overhead for Company 
RAD level 8 data disks Pros Cons products 
0 Nonredundant O failures, No space overhead No protection Widely used 
striped O check disks 
1 Mirrored 1 failure, No parity calculation; fast Highest check EMC, HP 
8 check disks recovery; small writes storage overhead (Tandem), IBM 
faster than higher RAIDs; 
fast reads 
2 Memory-style ECC 1 failure, Doesn't rely on failed disk ~ Log 2 check Not used 
4 check disks to self-diagnose storage overhead 
3 Bit-interleaved 1 failure, Low check overhead; high No support for Storage 
parity 1 check disk bandwidth for large reads or small, random Concepts 
writes reads or writes 
4 Block-interleaved 1 failure, Low check overhead; more Parity disk is small Network 
parity 1 check disk bandwidth for small reads write bottleneck Appliance 
5 Block-interleaved 1 failure, Low check overhead; more Small writes —>4 Widely used 
distributed parity 1 check disk bandwidth for small reads disk accesses 
and writes 
6 Row-diagonal 2 failures, Protects against 2 disk Small writes —> 6 Network 
parity, EVEN-ODD 2 check disks failures disk accesses; 2X Appliance 


check overhead 





Figure 6.4 RAID levels, their fault tolerance, and their overhead in redundant disks. The paper that introduced 
the term RAID [Patterson, Gibson, and Katz 1987] used a numerical classification that has become popular. In fact, the 
nonredundant disk array is often called RAID 0, indicating the data are striped across several disks but without 
redundancy. Note that mirroring (RAID 1) in this instance can survive up to eight disk failures provided only one disk 
of each mirrored pair fails; worst case is both disks in a mirrored pair. In 2006, there may be no commercial imple- 
mentations of RAID 2; the rest are found in a wide range of products. RAID 0+ 1,1 + 0,01,10, and 6 are discussed in 
the text. 


e RAID 3—Since the higher-level disk interfaces understand the health of a 


disk, it's easy to figure out which disk failed. Designers realized that if one 
extra disk contains the parity of the information in the data disks, a single 
disk allows recovery from a disk failure. The data is organized in stripes, with 
Ndata. blocks and one parity block. When a failure occurs, you just "subtract" 
the good data from the good blocks, and what remains is the missing data. 
(This works whether the failed disk is a data disk or the parity disk.) RAID 3 
assumes the data is spread across all disks on reads and writes, which is 
attractive when reading or writing large amounts of data. 


RAID 4—Many applications are dominated by small accesses. Since sectors 
have their own error checking, you can safely increase the number of reads 
per second by allowing each disk to perform independent reads. It would 
seem that writes would still be slow, if you have to read every disk to calcu- 
late parity. To increase the number of writes per second, an alternative 
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approach involves only two disks. First, the array reads the old data that is 
about to be overwritten, and then calculates what bits would change before 
it writes the new data. It then reads the old value of the parity on the check 
disks, updates parity according to the list of changes, and then writes the 
new value of parity to the check disk. Hence, these so-called "small writes" 
are still slower than small reads—they involve four disks accesses—but 
they are faster than if you had to read all disks on every write. RAID 4 has 
the same low check disk overhead as RAID 3, and it can still do large reads 
and writes as fast as RAID 3 in addition to small reads and writes, but con- 
trol is more complex. 


e RAID 5—Note that a performance flaw for small writes in RAID 4 is that they 
all must read and write the same check disk, so it is a performance bottleneck. 
RAID 5 simply distributes the parity information across all disks in the array, 
thereby removing the bottleneck. The parity block in each stripe is rotated so 
that parity is spread evenly across all disks. The disk array controller must 
now calculate which disk has the parity for when it wants to write a given 
block, but that can be a simple calculation. RAID 5 has the same low check 
disk overhead as RAID 3 and 4, and it can do the large reads and writes of 
RAID 3 and the small reads of RAID 4, but it has higher small write band- 
width than RAID 4. Nevertheless, RAID 5 requires the most sophisticated 
controller of the classic RAID levels. 


Having completed our quick review of the classic RAID levels, we can now 
look at two levels that have become popular since RAID was introduced. 


RAID 10 versus 01 (orl +0 versus RAID 0+1) 


One topic not always described in the RAID literature involves how mirroring in 
RAID 1 interacts with striping. Suppose you had, say, four disks worth of data to 
store and eight physical disks to use. Would you create four pairs of disks—each 
organized as RAID 1—and then stripe data across the four RAID 1 pairs? Alter- 
natively, would you create two sets of four disks—each organized as RAID 0— 
and then mirror writes to both RAID 0 sets? The RAID terminology has evolved 
to call the former RAID 1 + 0 or RAID 10 ("striped mirrors") and the latter 
RAID 0 + 1 or RAID 01 ("mirrored stripes"). 


RAID 6: Beyond a Single Disk Failure 


The parity-based schemes of the RAID 1 to 5 protect against a single self-identi- 
fying failure. However, if an operator accidentally replaces the wrong disk during 
a failure, then the disk array will experience two failures, and data will be lost. 
Another concern with is that since disk bandwidth is growing more slowly than 
disk capacity, the MTTR of a disk in a RAID system is increasing, which in turn 
increases the chances of a second failure. For example, a 500 GB SATA disk 
could take about 3 hours to read sequentially assuming no interference. Given 
that the damaged RAID is likely to continue to serve data, reconstruction could 
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be stretched considerably, thereby increasing MTTR. Besides increasing recon- 
struction time, another concern is that reading much more data during reconstruc- 
tion means increasing the chance of an uncorrectable media failure, which would 
result in data loss. Other arguments for concern about simultaneous multiple fail- 
ures are the increasing number of disks in arrays and the use of ATA disks, which 
are slower and larger than SCSI disks. 

Hence, over the years, there has been growing interest in protecting against 
more than one failure. Network Appliance, for example, started by building 
RAID 4 file servers. As double failures were becoming a danger to customers, 
they created a more robust scheme to protect data, called row-diagonal parity or 
RAID-DP [Corbett 2004]. Like the standard RAID schemes, row-diagonal parity 
uses redundant space based on a parity calculation on a per-stripe basis. Since it 
is protecting against a double failure, it adds two check blocks per stripe of data. 
Let's assume there arep + 1 disks total, and sop - 1 disks have data. Figure 6.5 
shows the case when p is 5. 

The row parity disk is just like in RAID 4; it contains the even parity across 
the other four data blocks in its stripe. Each block of the diagonal parity disk con- 
tains the even parity of the blocks in the same diagonal. Note that each diagonal 
does not cover one disk; for example, diagonal 0 does not cover disk 1. Hence, 
we need just p—\ diagonals to protect the p disks, so the disk only has diagonals 
0 to 3 in Figure 6.5. 

Let's see how row-diagonal parity works by assuming that data disks 1 and 3 
fail in Figure 6.5. We can't perform the standard RAID recovery using the first 
row using row parity, since it is missing two data blocks from disks 1 and 3. 
However, we can perform recovery on diagonal 0, since it is only missing the 
data block associated with disk 3. Thus, row-diagonal parity starts by recovering 
one of the four blocks on the failed disk in this example using diagonal parity. 
Since each diagonal misses one disk, and all diagonals miss a different disk, two 
diagonals are only missing one block. They are diagonals 0 and 2 in this example, 


Data disk 0 Data disk 2 (mao) Diagonal parity 
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Figure 6.5 Row diagonal parity for p = 5, which protects four data disks from double 
failures [Corbett 2004]. This figure shows the diagonal groups for which parity is calcu- 
lated and stored in the diagonal parity disk. Although this shows all the check data in 
separate disks for row parity and diagonal parity as in RAID 4, there is a rotated version 
of row-diagonal parity that is analogous to RAID 5. Parameter p must be prime and 
greater than 2. However, you can make p larger than the number of data disks by 
assuming the missing disks have all zeros, and the scheme still works. This trick makes it 
easy to add disks to an existing system. NetApp picks p to be 257, which allows the sys- 
tem to grow to up to 256 data disks. 
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so we next restore the block from diagonal 2 from failed disk 1. Once the data for 
those blocks is recovered, then the standard RAID recovery scheme can be used 
to recover two more blocks in the standard RAID 4 stripes 0 and 2, which in turn 
allows us to recover more diagonals. This process continues until two failed disks 
are completely restored. 

The EVEN-ODD scheme developed earlier by researchers at IBM is similar 
to row diagonal parity, but it has a bit more computation during operation and 
recovery [Blaum 1995]. Papers that are more recent show how to expand EVEN- 
ODD to protect against three failures [Blaum 1996; Blaum 2001]. 


Definition and Examples of Real Faults and Failures 


Although people may be willing to live with a computer that occasionally crashes 
and forces all programs to be restarted, they insist that their information is never 
lost. The prime directive for storage is then to remember information, no matter 
what happens. 

Chapter 1 covered the basics of dependability, and this section expands that 
information to give the standard definitions and examples of failures. 

The first step is to clarify confusion over terms. The terms fault, error, and 

failure are often used interchangeably, but they have different meanings in the 
dependability literature. For example, is a programming mistake a fault, error, or 
failure? Does it matter whether we are talking about when it was designed, or 
when the program is run? If the running program doesn't exercise the mistake, is 
it still a fault/error/failure? Try another one. Suppose an alpha particle hits a 
DRAM memory cell. Is it a fault/error/failure if it doesn't change the value? Is it 
a fault/error/failure if the memory doesn't access the changed bit? Did a fault/ 
error/failure still occur if the memory had error correction and delivered the cor- 
rected value to the CPU? You get the drift of the difficulties. Clearly, we need pre- 
cise definitions to discuss such events intelligently. 

To avoid such imprecision, this subsection is based on the terminology used 
by Laprie [1985] and Gray and Siewiorek [1991], endorsed by IFIP working 
group 10.4 and the IEEE Computer Society Technical Committee on Fault Toler- 
ance. We talk about a system as a single module, but the terminology applies to 
submodules recursively. Let's start with a definition of dependability: 


Computer system dependability is the quality of delivered service such that reli- 
ance can justifiably be placed on this service. The service delivered by a system 
is its observed actual behavior as perceived by other system(s) interacting with 
this system's users. Each module also has an ideal specified behavior, where a 
service specification is an agreed description of the expected behavior. A system 

failure occurs when the actual behavior deviates from the specified behavior. 
The failure occurred because of an error, a defect in that module. The cause of 
an error is a. fault. 


When a fault occurs, it creates a latent error, which becomes effective when it is 
activated; when the error actually affects the delivered service, a failure occurs. 
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The time between the occurrence of an error and the resulting failure is the error 
latency. Thus, an error is the manifestation in the system of a fault, and a failure is 
the manifestation on the service of an error, [p. 3] 


Let's go back to our motivating examples above. A programming mistake is a 
fault. The consequence is an error (or latent error) in the software. Upon activa- 
tion, the error becomes effective. When this effective error produces erroneous 
data that affect the delivered service, a failure occurs. 

An alpha particle hitting a DRAM can be considered a fault. If it changes the 
memory, it creates an error. The error will remain latent until the affected mem- 
ory word is read. If the effective word error affects the delivered service, a failure 
occurs. If ECC corrected the error, a failure would not occur. 

A mistake by a human operator is a fault. The resulting altered data is an 
error. It is latent until activated, and so on as before. 

To clarify, the relation between faults, errors, and failures is as follows: 


e A fault creates one or more latent errors. 


e The properties of errors are (1) a latent error becomes effective once acti- 
vated; (2) an error may cycle between its latent and effective states; (3) an 
effective error often propagates from one component to another, thereby cre- 
ating new errors. Thus, either an effective error is a formerly latent error in 
that component, or it has propagated from another error in that component or 
from elsewhere. 


e A component failure occurs when the error affects the delivered service. 


e These properties are recursive and apply to any component in the system. 
Gray and Siewiorek classify faults into four categories according to their cause: 


1. Hardware faults—Devices that fail, such as perhaps due to an alpha particle 
hitting a memory cell 


2. Design faults—Faults in software (usually) and hardware design (occasionally) 
Operation faults—Mistakes by operations and maintenance personnel 


4. Environmental faults—Fire, flood, earthquake, power failure, and sabotage 


Faults are also classified by their duration into transient, intermittent, and perma- 
nent [Nelson 1990]. Transient faults exist for a limited time and are not recurring. 
Intermittent faults cause a system to oscillate between faulty and fault-free oper- 
ation. Permanent faults do not correct themselves with the passing of time. 

Now that we have defined the difference between faults, errors, and failures, 
we are ready to see some real-world examples. Publications of real error rates are 
rare for two reasons. First, academics rarely have access to significant hardware 
resources to measure. Second, industrial researchers are rarely allowed to publish 
failure information for fear that it would be used against their companies in the 
marketplace. A few exceptions follow. 
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Berkeley's Tertiary Disk 


The Tertiary Disk project at the University of California created an art image 
server for the Fine Arts Museums of San Francisco. This database consists of 
high-quality images of over 70,000 artworks. The database was stored on a clus- 
ter, which consisted of 20 PCs connected by a switched Ethernet and containing 
368 disks. It occupied seven 7-foot-high racks. 

Figure 6.6 shows the failure rates of the various components of Tertiary Disk. 
In advance of building the system, the designers assumed that SCSI data disks 
would be the least reliable part of the system, as they are both mechanical and 
plentiful. Next would be the IDE disks, since there were fewer of them, then the 
power supplies, followed by integrated circuits. They assumed that passive 
devices like cables would scarcely ever fail. 

Figure 6.6 shatters some of those assumptions. Since the designers followed 
the manufacturer's advice of making sure the disk enclosures had reduced vibra- 
tion and good cooling, the data disks were very reliable. In contrast, the PC chas- 
sis containing the IDE/ATA disks did not afford the same environmental controls. 
(The IDE/ATA disks did not store data, but helped the application and operating 
system to boot the PCs.) Figure 6.6 shows that the SCSI backplane, cables, and 
Ethernet cables were no more reliable than the data disks themselves! 

As Tertiary Disk was a large system with many redundant components, it 
could survive this wide range of failures. Components were connected and mir- 
rored images were placed so that no single failure could make any image unavail- 
able. This strategy, which initially appeared to be overkill, proved to be vital. 

This experience also demonstrated the difference between transient faults and 
hard faults. Virtually all the failures in Figure 6.6 appeared first as transient 
faults. It was up to the operator to decide if the behavior was so poor that they 
needed to be replaced or if they could continue. In fact, the word "failure" was 
not used; instead, the group borrowed terms normally used for dealing with prob- 
lem employees, with the operator deciding whether a problem component should 
or should not be "fired." 


Tandem 


The next example comes from industry. Gray [1990] collected data on faults for 
Tandem Computers, which was one of the pioneering companies in fault-tolerant 
computing and used primarily for databases. Figure 6.7 graphs the faults that 
caused system failures between 1985 and 1989 in absolute faults per system and 
in percentage of faults encountered. The data show a clear improvement in the 
reliability of hardware and maintenance. Disks in 1985 needed yearly service by 
Tandem, but they were replaced by disks that needed no scheduled maintenance. 
Shrinking numbers of chips and connectors per system plus software's ability to 
tolerate hardware faults reduced hardware's contribution to only 7% of failures 
by 1989. Moreover, when hardware was at fault, software embedded in the hard- 
ware device (firmware) was often the culprit. The data indicate that software in 
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Percentage 

Component Total in system Total failed failed 

SCSI controller 44 1 2.3% 
SCSI cable 39 1 2.6% 
SCSI disk 368 7 19% 
IDE/ATA disk 24 6 25.0% 
Disk enclosure—backplane 46 13 28.3% 
Disk enclosure—power supply 92 3 3.3% 
Ethernet controller 20 1 5.0% 
Ethernet switch 2 1 50.0% 
Ethernet cable 42 1 2.3% 
CPU/motherboard 20 0 0% 





Figure 6.6 Failures of components in Tertiary Disk over 18 months of operation. 
For each type of component, the table shows the total number in the system, the 
number that failed, and the percentage failure rate. Disk enclosures have two entries 
in the table because they had two types of problems: backplane integrity failures and 
power supply failures. Since each enclosure had two power supplies, a power supply 
failure did not affect availability. This cluster of 20 PCs, contained in seven 7-foot- 
high, 19-inch-wide racks, hosts 368 8.4 GB, 7200 RPM, 3.5-inch IBM disks.The PCs are 
P6-200 MHz with 96 MB of DRAM each.They ran FreeBSD 3.0, and the hosts are con- 
nected via switched 100 Mbit/sec Ethernet. All SCSI disks are connected to two PCs 
via double-ended SCSI chains to support RAID 1 The primary application is called the 
Zoom Project, which in 1998 was the world's largest art image database, with 72,000 
images.SeeTalagalaetal. [2000]. 


1989 was the major source of reported outages (62%), followed by system opera- 
tions (15%). 

The problem with any such statistics is that these data only refer to what is 
reported; for example, environmental failures due to power outages were not 
reported to Tandem because they were seen as a local problem. Data on operation 
faults is very difficult to collect because it relies on the operators to report per- 
sonal mistakes, which may affect the opinion of their managers, which in turn 
can affect job security and pay raises. Gray believes both environmental faults 
and operator faults are underreported. His study concluded that achieving higher 
availability requires improvement in software quality and software fault toler- 
ance, simpler operations, and tolerance of operational faults. 


Other Studies of the Role of Operators in Dependability 


While Tertiary Disk and Tandem are storage-oriented dependability studies, we 
need to look outside storage to find better measurements on the role of humans 
in failures. Murphy and Gent [1995] tried to improve the accuracy of data on 
operator faults by having the system automatically prompt the operator on each 
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Figure 6.7 Faults in Tandem between 1985 and 1989. Gray [1990] collected these data for the fault-tolerant Tan- 
dem computers based on reports of component failures by customers. 


boot for the reason for that reboot. They classified consecutive crashes to the 
same fault as operator fault and included operator actions that directly resulted 
in crashes, such as giving parameters bad values, bad configurations, and bad 
application installation. Although they believe operator error is still under- 
reported, they did get more accurate information than did Gray, who relied on a 
form that the operator filled out and then sent up the management chain. The 
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hardware/operating system went from causing 70% of the failures in VAX sys- 
tems in 1985 to 28% in 1993, and failures due to operators rose from 15% to 
52% in that same period. Murphy and Gent expected managing systems to be 
the primary dependability challenge in the future. 

The final set of data comes from the government. The Federal Communica- 
tions Commission (FCC) requires that all telephone companies submit explana- 
tions when they experience an outage that affects at least 30,000 people or lasts 
30 minutes. These detailed disruption reports do not suffer from the self- 
reporting problem of earlier figures, as investigators determine the cause of the 
outage rather than operators of the equipment. Kuhn [1997] studied the causes of 
outages between 1992 and 1994, and Enriquez [2001] did a follow-up study for 
the first half of 2001. Although there was a significant improvement in failures 
due to overloading of the network over the years, failures due to humans 
increased, from about one-third to two-thirds of the customer-outage minutes. 

These four examples and others suggest that the primary cause of failures in 
large systems today is faults by human operators. Hardware faults have declined 
due to a decreasing number of chips in systems and fewer connectors. Hardware 
dependability has improved through fault tolerance techniques such as memory 
ECC and RAID. At least some operating systems are considering reliability 
implications before adding new features, so in 2006 the failures largely occurred 
elsewhere. 

Although failures may be initiated due to faults by operators, it is a poor 
reflection on the state of the art of systems that the process of maintenance and 
upgrading are so error prone. Most storage vendors claim today that customers 
spend much more on managing storage over its lifetime than they do on purchas- 
ing the storage. Thus, the challenge for dependable storage systems of the future 
is either to tolerate faults by operators or to avoid faults by simplifying the tasks 
of system administration. Note that RAID 6 allows the storage system to survive 
even if the operator mistakenly replaces a good disk. 

We have now covered the bedrock issue of dependability, giving definitions, 
case studies, and techniques to improve it. The next step in the storage tour is per- 
formance. 


VO Performance, Reliability Measures, and Benchmarks 


T/O performance has measures that have no counterparts in design. One of these 
is diversity: which I/O devices can connect to the computer system? Another is 
capacity: how many I/O devices can connect to a computer system? 

In addition to these unique measures, the traditional measures of perfor- 
mance, namely, response time and throughput, also apply to I/O. (I/O throughput 
is sometimes called I/O bandwidth, and response time is sometimes called 
latency.) The next two figures offer insight into how response time and through- 
put trade off against each other. Figure 6.8 shows the simple producer-server 
model. The producer creates tasks to be performed and places them in a buffer; 
the server takes tasks from the first in, first out buffer and performs them. 
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Producer 








Figure 6.8 The traditional producer-server model of response time and throughput. 
Response time begins when a task is placed in the buffer and ends when it is com- 
pleted by the server.Throughput is the number of tasks completed by the server in unit 
time. 


Response time is defined as the time a task takes from the moment it is placed 
in the buffer until the server finishes the task. Throughput is simply the average 
number of tasks completed by the server over a time period. To get the highest 
possible throughput, the server should never be idle, and thus the buffer should 
never be empty. Response time, on the other hand, counts time spent in the buffer, 
so an empty buffer shrinks it. 

Another measure of I/O performance is the interference of I/O with processor 
execution. Transferring data may interfere with the execution of another process. 
There is also overhead due to handling I/O interrupts. Our concern here is how 
much longer a process will take because of I/O for another process. 


Throughput versus Response Time 


Figure 6.9 shows throughput versus response time (or latency) for a typical I/O 
system. The knee of the curve is the area where a little more throughput results in 
much longer response time or, conversely, a little shorter response time results in 
much lower throughput. 

How does the architect balance these conflicting demands? If the computer is 
interacting with human beings, Figure 6.10 suggests an answer. An interaction, or 
transaction, with a computer is divided into three parts: 


1. Entry time—The time for the user to enter the command. 


2. System response time—The time between when the user enters the command 
and the complete response is displayed. 


3. Think time—The time from the reception of the response until the user begins 
to enter the next command. 


The sum of these three parts is called the transaction time. Several studies report 
that user productivity is inversely proportional to transaction time. The results in 
Figure 6.10 show that cutting system response time by 0.7 seconds saves 4.9 sec- 
onds (34%) from the conventional transaction and 2.0 seconds (70%) from the 
graphics transaction. This implausible result is explained by human nature: Peo- 
ple need less time to think when given a faster response. Although this study is 20 
years old, response times are often still much slower than 1 second, even if 
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Figure 6.9 Throughput versus response time. Latency is normally reported as 
response time. Note that the minimum response time achieves only 11% of the 
throughput, while the response time for 100% throughput takes seven times the mini- 
mum response time. Note that the independent variable in this curve is implicit: to 
trace the curve, you typically vary load (concurrency). Chen et al. [1990] collected these 
data for an array of magnetic disks. 


Workload 


i 
| 


Conventional interactive workload 
















(1.0 sec system response time) J 
oe 4 

Conventional interactive workload ] —34% total 

(0.3 sec system response time) be {-70% think) 


High-function graphics workload 
(1.0 sec system response time) 






E 
-70% total 
(-81% think) 


High-function graphics workload 
(0.3 sec system response time) 






Cf 





0 5 10 15 


Time (sec) 


E Entry time Œ System response time C Think time | 


Figure 6.10 A user transaction with an interactive computer divided into entry 
time, system response time, and user think time for a conventional system and 
graphics system. The entry times are the same, independent of system response time. 
The entry time was 4 seconds for the conventional system and 0.25 seconds for the 
graphics system. Reduction in response time actually decreases transaction time by 
more than just the response time reduction. (From Brady [1986].) 
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VO benchmark Response time restriction Throughput metric 
TPC-C: Complex > 90% of transaction must meet new order 
Query OLTP response time limit; 5 seconds for transactions 

most types of transactions per minute 
TPC-W: Transactional > 90% of Web interactions must meet Web interactions 
Web benchmark response time limit; 3 seconds for per second 

most types of Web interactions 
SPECsfs97 average response time < 40 ms NFS operations 

per second 





Figure 6.11 Response time restrictions for three VO benchmarks. 


processors are 1000 times faster. Examples of long delays include starting an 
application on a desktop PC due to many disk I/Os, or network delays when 
clicking on Web links. 

To reflect the importance of response time to user productivity, I/O bench- 
marks also address the response time versus throughput trade-off. Figure 6.11 
shows the response time bounds for three I/O benchmarks. They report maximum 
throughput given either that 90% of response times must be less than a limit or 
that the average response time must be less than a limit. 

Let's next look at these benchmarks in more detail. 


Transaction-Processing Benchmarks 


Transaction processing (TP, or OLTP for online transaction processing) is chiefly 
concerned with I/O rate (the number of disk accesses per second), as opposed to 
data rate (measured as bytes of data per second). TP generally involves changes 
to a large body of shared information from many terminals, with the TP system 
guaranteeing proper behavior on a failure. Suppose, for example, a bank's com- 
puter fails when a customer tries to withdraw money from an ATM. The TP sys- 
tem would guarantee that the account is debited if the customer received the 
money and that the account is unchanged if the money was not received. Airline 
reservations systems as well as banks are traditional customers for TP. 

As mentioned in Chapter 1, two dozen members of the TP community con- 
spired to form a benchmark for the industry and, to avoid the wrath of their legal 
departments, published the report anonymously [Anon, et al. 1985]. This report 
led to the Transaction Processing Council, which in turn has led to eight bench- 
marks since its founding. Figure 6.12 summarizes these benchmarks. 

Let's describe TPC-C to give a flavor of these benchmarks. TPC-C uses a 
database to simulate an order-entry environment of a wholesale supplier, includ- 
ing entering and delivering orders, recording payments, checking the status of 
orders, and monitoring the level of stock at the warehouses. It runs five concur- 
rent transactions of varying complexity, and the database includes nine tables 
with a scalable range of records and customers. TPC-C is measured in transac- 
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Benchmark Data size (GB) Performance metric Date of first results 

A: debit credit (retired) 0.1-10 transactions per second July 1990 

B: batch debit credit (retired) 0.1-10 transactions per second July 1991 

C: complex query OLTP 100-3000 new order transactions September 1992 
(minimum 0.07 * TPM) per minute (TPM) 

D: decision support (retired) 100, 300, 1000 queries per hour December 1995 

H: ad hoc decision support 100, 300, 1000 queries per hour October 1999 

R: business reporting decision 1000 queries per hour August 1999 

support (retired) 

W: transactional Web benchmark = 50, 500 Web interactions per second July 2000 

App: application server and Web = 2500 Web service interactions June 2005 


services benchmark 


per second (SIPS) 





Figure 6.12 Transaction Processing Council benchmarks. The summary results include both the performance 
metric and the price-performance of that metric. TPC-A,TPC-B,TPC-D, and TPC-R were retired. 


tions per minute (tpmC) and in price of system, including hardware, software, 
and three years of maintenance support. Figure 1.16 on page 46 in Chapter 1 
describes the top systems in performance and cost-performance for TPC-C. 


These TPC benchmarks were the first—and in some cases still the only 


ones—that have these unusual characteristics: 


Price is included with the benchmark results. The cost of hardware, software, 
and maintenance agreements is included in a submission, which enables evalu- 
ations based on price-performance as well as high performance. 


The data set generally must scale in size as the throughput increases. The 
benchmarks are trying to model real systems, in which the demand on the 
system and the size of the data stored in it increase together. It makes no 
sense, for example, to have thousands of people per minute access hundreds 
of bank accounts. 


The benchmark results are audited. Before results can be submitted, they 
must be approved by a certified TPC auditor, who enforces the TPC rules that 
try to make sure that only fair results are submitted. Results can be chal- 
lenged and disputes resolved by going before the TPC. 


Throughput is the performance metric, but response times are limited. For 
example, with TPC-C, 90% of the New-Order transaction response times 
must be less than 5 seconds. 


An independent organization maintains the benchmarks. Dues collected by 
TPC pay for an administrative structure including a Chief Operating Office. 
This organization settles disputes, conducts mail ballots on approval of 
changes to benchmarks, holds board meetings, and so on. 
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SPEC System-Level File Server, Mail, and Web Benchmarks 


The SPEC benchmarking effort is best known for its characterization of proces- 
sor performance, but it has created benchmarks for file servers, mail servers, and 
Web servers. 

Seven companies agreed on a synthetic benchmark, called SFS, to evaluate 
systems running the Sun Microsystems network file service (NFS). This bench- 
mark was upgraded to SFS 3.0 (also called SPEC SFS97_R1) to include support 
for NFS version 3, using TCP in addition to UDP as the transport protocol, and 
making the mix of operations more realistic. Measurements on NFS systems led 
to a synthetic mix of reads, writes, and file operations. SFS supplies default 
parameters for comparative performance. For example, half of all writes are done 
in 8 KB blocks and half are done in partial blocks of 1, 2, or 4 KB. For reads, the 
mix is 85% full blocks and 15% partial blocks. 

Like TPC-C, SFS scales the amount of data stored according to the reported 
throughput: For every 100 NFS operations per second, the capacity must increase 
by 1 GB. It also limits the average response time, in this case to 40 ms. Figure 
6.13 shows average response time versus throughput for two NetApp systems. 
Unfortunately, unlike the TPC benchmarks, SFS does not normalize for different 
price configurations. 

SPECMail is a benchmark to help evaluate performance of mail servers at an 
Internet service provider. SPECMail2001 is based on the standard Internet proto- 
cols SMTP and POP3, and it measures throughput and user response time while 
scaling the number of users from 10,000 to 1,000,000. 

SPECWeb is a benchmark for evaluating the performance of World Wide Web 
servers, measuring number of simultaneous user sessions. The SPECWeb2005 





Response time (ms) 











50,000 75,000 100,000 125,000 150,000 
Operations/second 


Figure 6.13 SPEC SFS97_R1 performance for the NetApp FAS3050c NFS servers in two configurations.Two pro- 
cessors reached 34,089 operations per second and four processors did 47,927. Reported in May 2005, these sys- 
tems used the Data ONTAP 7.0.1 R1 operating system, 2.8 GHz Pentium Xeon microprocessors, 2 GB of DRAM per 
processor, 1 GB of nonvolatile memory per system, and 168 15K RPM, 72 GB, fibre channel disks. These disks were 
connected using two or four QLogic ISP-2322 FC disk controllers. 
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workload simulates accesses to a Web service provider, where the server supports 
home pages for several organizations. It has three workloads: Banking (HTTPS), 
E-commerce (HTTP and HTTPS), and Support (HTTP). 


Examples of Benchmarks of Dependability 


The TPC-C benchmark does in fact have a dependability requirement. The 
benchmarked system must be able to handle a single disk failure, which means in 
practice that all submitters are running some RAID organization in their storage 
system. 

Efforts that are more recent have focused on the effectiveness of fault toler- 
ance in systems. Brown and Patterson [2000] propose that availability be mea- 
sured by examining the variations in system quality-of-service metrics over time 
as faults are injected into the system. For a Web server the obvious metrics are 
performance (measured as requests satisfied per second) and degree of fault toler- 
ance (measured as the number of faults that can be tolerated by the storage sub- 
system, network connection topology, and so forth). 

The initial experiment injected a single fault—such as a write error in disk 
sector—and recorded the system's behavior as reflected in the quality-of-service 
metrics. The example compared software RAID implementations provided by 
Linux, Solaris, and Windows 2000 Server. SPECWeb99 was used to provide a 
workload and to measure performance. To inject faults, one of the SCSI disks in 
the software RAID volume was replaced with an emulated disk. It was a PC run- 
ning software using a SCSI controller that appears to other devices on the SCSI 
bus as a disk. The disk emulator allowed the injection of faults. The faults 
injected included a variety of transient disk faults, such as correctable read errors, 
and permanent faults, such as disk media failures on writes. 

Figure 6.14 shows the behavior of each system under different faults. The two 
top graphs show Linux (on the left) and Solaris (on the right). As RAID systems 
can lose data if a second disk fails before reconstruction completes, the longer the 
reconstruction (MTTR), the lower the availability. Faster reconstruction implies 
decreased application performance, however, as reconstruction steals I/O 
resources from running applications. Thus, there is a policy choice between tak- 
ing a performance hit during reconstruction, or lengthening the window of vul- 
nerability and thus lowering the predicted MTTF. 

Although none of the tested systems documented their reconstruction policies 
outside of the source code, even a single fault injection was able to give insight 
into those policies. The experiments revealed that both Linux and Solaris initiate 
automatic reconstruction of the RAID volume onto a hot spare when an active 
disk is taken out of service due to a failure. Although Windows supports RAID 
reconstruction, the reconstruction must be initiated manually. Thus, without 
human intervention, a Windows system that did not rebuild after a first failure 
remains susceptible to a second failure, which increases the window of vulnera- 
bility. It does repair quickly once told to do so. 
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Figure 6.14 Availability benchmark for software RAID systems on the same computer running Red Hat 6.0 
Linux, Solaris 7, and Windows 2000 operating systems. Note the difference in philosophy on speed of reconstruc- 
tion of Linux versus Windows and Solaris. The y-axis is behavior in hits per second running SPECWeb99.The arrow 
indicates time of fault insertion. The lines at the top give the 99% confidence interval of performance before the fault 
is inserted. A 99% confidence interval means that if the variable is outside of this range, the probability is only 1% 


that this value would appear. 


The fault injection experiments also provided insight into other availability 
policies of Linux, Solaris, and Windows 2000 concerning automatic spare utiliza- 
tion, reconstruction rates, transient errors, and so on. Again, no system docu- 


mented their policies. 


In terms of managing transient faults, the fault injection experiments revealed 
that Linux's software RAID implementation takes an opposite approach than do 
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the RAID implementations in Solaris and Windows. The Linux implementation 
is paranoid—it would rather shut down a disk in a controlled manner at the first 
error, rather than wait to see if the error is transient. In contrast, Solaris and Win- 
dows are more forgiving—they ignore most transient faults with the expectation 
that they will not recur. Thus, these systems are substantially more robust to 
transients than the Linux system. Note that both Windows and Solaris do log the 
transient faults, ensuring that the errors are reported even if not acted upon. When 
faults were permanent, the systems behaved similarly. 


A Little Queuing Theory 


In processor design, we have simple back-of-the-envelope calculations of perfor- 
mance associated with the CPI formula in Chapter 1, or we can use full scale sim- 
ulation for greater accuracy at greater cost. In I/O systems, we also have a best- 
case analysis as a back-of-the-envelope calculation. Full-scale simulation is also 
much more accurate and much more work to calculate expected performance. 

With I/O systems, however, we also have a mathematical tool to guide I/O 
design that is a little more work and much more accurate than best-case analysis, 
but much less work than full-scale simulation. Because of the probabilistic nature 
of I/O events and because of sharing of I/O resources, we can give a set of simple 
theorems that will help calculate response time and throughput of an entire I/O 
system. This helpful field is called queuing theory. Since there are many books 
and courses on the subject, this section serves only as a first introduction to the 
topic. However, even this small amount can lead to better design of I/O systems. 

Let's start with a black-box approach to I/O systems, as in Figure 6.15. In our 
example, the processor is making I/O requests that arrive at the I/O device, and 
the requests "depart" when the I/O device fulfills them. 

We are usually interested in the long term, or steady state, of a system rather 
than in the initial start-up conditions. Suppose we weren't. Although there is a 
mathematics that helps (Markov chains), except for a few cases, the only way to 
solve the resulting equations is simulation. Since the purpose of this section is to 
show something a little harder than back-of-the-envelope calculations but less 
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Figure 6.15 Treating the I/O system as a black box. This leads to a simple but impor- 
tant observation: Ifthe system is in steady state, then the number of tasks entering the 
system must equal the number of tasks leaving the system.This flow-balanced state is 
necessary but not sufficient for steady state. If the system has been observed or mea- 
sured for a sufficiently long time and mean waiting times stabilize, then we say that the 
system has reached steady state. 
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than simulation, we won't cover such analyses here. (See the references in 
Appendix K for more details.) 

Hence, in this section we make the simplifying assumption that we are evalu- 
ating systems with multiple independent requests for I/O service that are in equi- 
librium: the input rate must be equal to the output rate. We also assume there is a 
steady supply of tasks independent for how long they wait for service. In many 
real systems, such as TPC-C, the task consumption rate is determined by other 
system characteristics, such as memory capacity. 

This leads us to Little's Law, which relates the average number of tasks in 
the system, the average arrival rate of new tasks, and the average time to perform 
a task: 


Mean number of tasks in system = Arrival rate x Mean response time 


Little's Law applies to any system in equilibrium, as long as nothing inside the 
black box is creating new tasks or destroying them. Note that the arrival rate and 
the response time must use the same time unit; inconsistency in time units is a 
common cause of errors. 

Let's try to derive Little's Law. Assume we observe a system for Timeobserve 
minutes. During that observation, we record how long it took each task to be 
serviced, and then sum those times. The number of tasks completed during 
Timeobserve is Numberjask, and the sum of the times each task spends in the sys- 
tem is Timeaccumulatea. Note that the tasks can overlap in time, so Timeaccumulated > 


Time Then 





observed- 
Time 
r . accumulated 
Mean number of tasks in system = —= —_—_ 
iL PS 
Time 
‘ accumulated 
Mean response time = — aes. 
umber, 5 
; Number asks 
Arrival rate = Tué 
ime observe 
Algebra lets us split the first formula: 
Time cumulated S Time cumulated Number asks 
Time hierve Number asks TIME oycorve 


If we substitute the three definitions above into this formula, and swap the result- 
ing two terms on the right-hand side, we get Little's Law: 


Mean number of tasks in system = Arrival rate x Mean response time 


This simple equation is surprisingly powerful, as we shall see. 

If we open the black box, we see Figure 6.16. The area where the tasks accu- 
mulate, waiting to be serviced, is called the queue, or waiting line. The device 
performing the requested service is called the server. Until we get to the last two 
pages of this section, we assume a single server. 
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Figure 6.16 The single-server model for this section. In this situation, an I/O request 
"departs" by being completed by the server. 


Little's Law and a series of definitions lead to several useful equations: 


¢  Timeserver—Average time to service a task; average service rate is 1/Timeserver 
traditionally represented by the symbol u in many queuing texts. 


e Time ye—Average time per task in the queue. 


e Time  tera—Average time/task in the system, or the response time, which is 
the sum of Time „e and Timegerver 


e Arrival rate—Average number of arriving tasks/second, traditionally repre- 
sented by the symbol A, in many queuing texts. 


e Lengthservee—Average number of tasks in service. 
e Lengthgueue—Average length of queue. 


e Lengthsystem—Average number of tasks in system, which is the sum of 


and Len om 


Len othqueue g™server- 


One common misunderstanding can be made clearer by these definitions: 
whether the question is how long a task must wait in the queue before service 
starts (Timegueue) or how long a task takes until it is completed (Time tem). The 
latter term is what we mean by response time, and the relationship between the 
terms is Timesystem = Time€queue + Time 

server* 

The mean number of tasks in service (Lengthgerver) is simply Arrival rate X 
Timeserver, Which is Little's Law. Server utilization is simply the mean number of 
tasks being serviced divided by the service rate. For a single server, the service 
rate is 1/Timeserver. Hence, server utilization (and, in this case, the mean number 
of tasks per server) is simply 


Server utilization = Arrival rate x Timeserver 


Service utilization must be between 0 and 1; otherwise, there would be more 
tasks arriving than could be serviced, violating our assumption that the system is 
in equilibrium. Note that this formula is just a restatement of Little's Law. Utili- 
zation is also called traffic intensity and is represented by the symbol p in many 
queuing theory texts. 
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Example 


Answer 


Suppose an I/O system with a single disk gets on average 50 I/O requests per sec- 
ond. Assume the average time for a disk to service an I/O request is 10 ms. What 
is the utilization of the I/O system? 


Using the equation above, with 10 ms represented as 0.01 seconds; we get: 


50 


Server utilization = Arrival rate x Time... = — X 0.01sec = 0.50 
sec 


Therefore, the I/O system utilization is 0.5. 


How the queue delivers tasks to the server is called the queue discipline. The 
simplest and most common discipline is first in, first out (FIFO). If we assume 
FIFO, we can relate time waiting in the queue to the mean number of tasks in the 
queue: 


PimMegueye = Lengthyyeye X TIMC..,yo, + Mean time to complete service of task when 


new task arrives if server is busy 


That is, the time in the queue is the number of tasks in the queue times the mean 
service time plus the time it takes the server to complete whatever task is being 
serviced when a new task arrives. (There is one more restriction about the arrival 
of tasks, which we reveal on page 384.) 

The last component of the equation is not as simple as it first appears. A new 
task can arrive at any instant, so we have no basis to know how long the existing 
task has been in the server. Although such requests are random events, if we 
know something about the distribution of events, we can predict performance. 


Poisson Distribution of Random Variables 


To estimate the last component of the formula we need to know a little about distri- 
butions of random variables. A variable is random if it takes one of a specified set 
of values with a specified probability; that is, you cannot know exactly what its next 
value will be, but you may know the probability of all possible values. 

Requests for service from an I/O system can be modeled by a random vari- 
able because the operating system is normally switching between several pro- 
cesses that generate independent I/O requests. We also model I/O service times 
by a random variable given the probabilistic nature of disks in terms of seek and 
rotational delays. 

One way to characterize the distribution of values of a random variable with 
discrete values is a histogram, which divides the range between the minimum and 
maximum values into subranges called buckets. Histograms then plot the number 
in each bucket as columns. 

Histograms work well for distributions that are discrete values—for example, 
the number of I/O requests. For distributions that are not discrete values, such as 
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time waiting for an I/O request, we have two choices. Either we need a curve to 
plot the values over the full range, so that we can estimate accurately the value, or 
we need a very fine time unit so that we get a very large number of buckets to 
estimate time accurately. For example, a histogram can be built of disk service 
times measured in intervals of 10 |xs although disk service times are truly contin- 
uous. 

Hence, to be able to solve the last part of the previous equation we need to 
characterize the distribution of this random variable. The mean time and some 
measure of the variance are sufficient for that characterization. 

For the first term, we use the weighted arithmetic mean time. Let's first 
assume that after measuring the number of occurrences, say, n;, of tasks, you 
could compute frequency of occurrence of task i: 


Then weighted arithmetic mean is 


Weighted arithmetic mean time =/;x 7’, +/2xrot...+/,,Xxf,, 


where 7; is the time for task ;' andjj is the frequency of occurrence of task i. 

To characterize variability about the mean, many people use the standard 
deviation. Let's use the variance instead, which is simply the square of the stan- 
dard deviation, as it will help us with characterizing the probability distribution. 
Given the weighted arithmetic mean, the variance can be calculated as 


2 2 2 2 
Variance = (/j x T -y + fo x T2 +...+/,, X Ta) - Weighted arithmetic mean time 

It is important to remember the units when computing variance. Let's assume the 
distribution is of time. If time is about 100 milliseconds, then squaring it yields 
10,000 square milliseconds. This unit is certainly unusual. It would be more con- 
venient if we had a unitless measure. 

To avoid this unit problem, we use the squared coefficient of variance, tradi- 
tionally called C”: 

C= Variance 

Weighted arithmetic mean time” 


We can solve for C, the coefficient of variance, as 


» [Variance = Standard deviation 
Weighted arithmetic mean time Weighted arithmetic mean time 





We are trying to characterize random events, but to be able to predict perfor- 
mance we need a distribution of random events where the mathematics is tracta- 
ble. The most popular such distribution is the exponential distribution, which has 
aC value of 1. 
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Example 


Answer 


Note that we are using a constant to characterize variability about the mean. 
The invariance of C over time reflects the property that the history of events has 
no impact on the probability of an event occurring now. This forgetful property is 
called memoryless, and this property is an important assumption used to predict 
behavior using these models. (Suppose this memoryless property did not exist; 
then we would have to worry about the exact arrival times of requests relative to 
each other, which would make the mathematics considerably less tractable!) 

One of the most widely used exponential distributions is called a Poisson dis- 
tribution, named after the mathematician Simeon Poisson. It is used to character- 
ize random events in a given time interval and has several desirable mathematical 
properties. The Poisson distribution is described by the following equation 
(called the probability mass function): 


a k 
e Xa 


k! 





Probability(k) = 


where a = Rate of events x Elapsed time. If interarrival times are exponentially 
distributed and we use arrival rate from above for rate of events, the number of 
arrivals in a time interval t is a Poisson process, which has the Poisson distribu- 
tion with a = Arrival rate x t As mentioned on page 382, the equation for 
Timegerver has another restriction on task arrival: It holds only for Poisson 
processes. 

Finally, we can answer the question about the length of time a new task must 
wait for the server to complete a task, called the average residual service time, 
which again assumes Poisson arrivals: 


2 
Average residual service time = 1/2 x Arithemtic mean x (1 +C ) 

Although we won't derive this formula, we can appeal to intuition. When the dis- 
tribution is not random and all possible values are equal to the average, the stan- 
dard deviation is 0 and so C is 0. The average residual service time is then just 
half the average service time, as we would expect. If the distribution is random 
and it is Poisson, then C is 1 and the average residual service time equals the 
weighted arithmetic mean time. 


Using the definitions and formulas above, derive the average time waiting in the 
queue (Timegueue) in terms of the average service time (Timeseryer) and server 
utilization. 


All tasks in the queue (Lengthgueue) ahead of the new task must be completed 
before the task can be serviced; each takes on average Timeserver. If a task is at 
the server, it takes average residual service time to complete. The chance the 
server is busy is server utilization; hence the expected time for service is Server 
utilization x Average residual service time. This leads to our initial formula: 


TIME ugue = Length gueuc x TME awr 


+ Server utilization X Average residual service time 
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Replacing average residual service time by its definition and Lengthgyoye by 
Arrival rate X Time gycye yields 


Time = Server utilization x (1/2 x Time x(1+ c’) 


queue server 


+ (Arrival rate x TiMe ueue) X Time 


queu server 


Since this section is concerned with exponential distributions, C? is 1. Thus 


Time = Server utilization x Time 


queue server + (Arrival rate x Time 


queue) x Time server 


Rearranging the last term, let us replace Arrival rate x Time,,,.., by Server utili- 
zation: 


Server utilization x Time + (Arrival rate x Time 


queue server ) x Time 


Time 


I 


server queue 


Server utilization x Time + Server utilization x Time 


server queue 


Rearranging terms and simplifying gives us the desired equation: 


Time goons’ = Server utilization x Time,...,.. + Server utilization x Time 


Time ueue — Server utilization x TiMegueye = Server utilization x TiMe eryer 


Time gy cue x (1 — Server utilization) = Server utilization x Time sryer 


queue 


Time = Time Server utilization 
ee server “` (1 — Server utilization) 


Little’s Law can be applied to the components of the black box as well, since 
they must also be in equilibrium: 
Length = Arrival rate x Time 


queue queue 


If we substitute for Time,yeye from above, we get 


F 3 Server utilization 
Length, „aœ = Arrival rate x Time... X ————_~ 
EM queue server“ (1 — Server utilization) 


Since Arrival rate X Time,.-yor = Server utilization, we can simplify further: 


server 


ee NETII. 
? ae Server utilization Server utilization 
Length, eye = Server utilization x ————________- = ——__________ 
q (1 —Server utilization) (1 — Server utilization) 


This relates number of items in queue to service utilization. 





Example 


Answer 


For the system in the example on page 382, which has a server utilization of 0.5, 
what is the mean number of I/O requests in the queue? 


Using the equation above, 


Length. = —Setverutilization” 05% _ 0.25 _ ys 
“ENE queue (1 — Server utilization) (1-0.5) 0.50 l 


Therefore, there are 0.5 requests on average in the queue. 
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As mentioned earlier, these equations and this section are based on an area of 
applied mathematics called queuing theory, which offers equations to predict 
behavior of such random variables. Real systems are too complex for queuing 
theory to provide exact analysis, and hence queuing theory works best when only 
approximate answers are needed. 

Queuing theory makes a sharp distinction between past events, which can be 
characterized by measurements using simple arithmetic, and future events, which 
are predictions requiring more sophisticated mathematics. In computer systems, 
we commonly predict the future from the past; one example is least-recently used 
block replacement (see Chapter 5). Hence, the distinction between measurements 
and predicted distributions is often blurred; we use measurements to verify the 
type of distribution and then rely on the distribution thereafter. 

Let's review the assumptions about the queuing model: 


e The system is in equilibrium. 


e The times between two successive requests arriving, called the interarrival 
times, are exponentially distributed, which characterizes the arrival rate men- 
tioned earlier. 


e The number of sources of requests is unlimited. (This is called an infinite 
population model in queuing theory; finite population models are used when 
arrival rates vary with the number of jobs already in the system.) 


e The server can start on the next job immediately after finishing the prior one. 


e There is no limit to the length of the queue, and it follows the first in, first out 
order discipline, so all tasks in line must be completed. 


e There is one server. 
Such a queue is called M/M/: 


M = exponentially random request arrival (C =1), with M standing for A. A. 
Markov, the mathematician who defined and analyzed the memoryless 
processes mentioned earlier 


M = exponentially random service time (C? = 1), with M again for Markov 


1 = single server 
The M/M/1 model is a simple and widely used model. 


The assumption of exponential distribution is commonly used in queuing 
examples for three reasons—one good, one fair, and one bad. The good reason is 
that a superposition of many arbitrary distributions acts as an exponential distri- 
bution. Many times in computer systems, a particular behavior is the result of 
many components interacting, so an exponential distribution of interarrival times 
is the right model. The fair reason is that when variability is unclear, an exponen- 
tial distribution with intermediate variability (C = 1) is a safer guess than low 
variability (C ~ 0) or high variability (large C). The bad reason is that the math is 
simpler if you assume exponential distributions. 
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Let's put queuing theory to work in a few examples. 


Example Suppose a processor sends 40 disk I/Os per second, these requests are exponen- 
tially distributed, and the average service time of an older disk is 20 ms. Answer 
the following questions: 

1. On average, how utilized is the disk? 
2. What is the average time spent in the queue? 


3. What is the average response time for a disk request, including the queuing 
time and disk service time? 


Answer Let's restate these facts: 
Average number of arriving tasks/second is 40. 


Average disk time to service a task is 20 ms (0.02 sec). 


The server utilization is then 


Server utilization = Arrival rate x Timeserver = 40x0.02 = 0.8 


Since the service times are exponentially distributed, we can use the simplified 
formula for the average time snent waiting in line.- 
Server utilization 


Time... = Time... m 
TENS server = (1 = Server utilization) 
0.8 0.8 
= S —_ = 2 — = = g 
20 ms Xp 0x7 20x4 = 80 ms 


The average response time is 
Time system = Timegueve + TiM€server = 80 + 20 ms = 100 ms 


Thus, on average we spend 80% of our time waiting in the queue! 


Example Suppose we get a new, faster disk. Recalculate the answers to the questions 
above, assuming the disk service time is 10 ms. 


Answer The disk utilization is then 


Server utilization = Arrival rate x Time€server = 40x0.01 = 04 


The formula for the average time spent waiting in line: 
Server utilization 





Time queue = Time server (1 — Server utilization) 
0.4 0.4 2 
= sx —— = 10x— = 1 => = 0. £ 
10 ms x 7204 06 0x3 6.7 ms 


The average response time is 10 + 6.7 ms or 16.7 ms, 6.0 times faster than the 
old response time even though the new service time is only 2.0 times faster. 
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Figure 6.17 The M/M/m multiple-server model. 


Thus far, we have been assuming a single server, such as a single disk. Many 
real systems have multiple disks and hence could use multiple servers, as in Fig- 
ure 6.17. Such a system is called an M/M/m model in queuing theory. 

Let's give the same formulas for the M/M/m queue, using Ngervers to represent 
the number of servers. The first two formulas are easy: 


Arrival rate X Time. 
N 


* “servers 


Utilization = 


Length ueue = Arrival rate x TiMCgueuc 


The time waiting in the queue is 


Prasks 2N 


servers 


Ti x = Ti a 
ime ime x (1 — Utilization) 


x 

queue server Niçes 

This formula is related to the one for M/M/1, except we replace utilization of 

a single server with the probability that a task will be queued as opposed to being 

immediately serviced, and divide the time in queue by the number of servers. 

Alas, calculating the probability of jobs being in the queue is much more compli- 

cated when there are Ngervers. First, the probability that there are no tasks in the 
system is 


in N 


N l n=-l 
7 lirat servers . lilieats 
Prob Bi y N sorvers * Utilization) (Nservers X Utilization) 


0 tasks N. 1x(l-— Utilization) i 2 n! 


servers n=l 





Then the probability there are as many or more tasks than we have servers is 


N 

4 ihat servers 
EN N servers X Utilization sea 

tasks> N z ! m ili ; 0 task: 

ZNecrwes —- Ngopvers! X (1 — Utilization) = 
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Note that if Nservers is 1, Probiasion simplifies back to Utilization, and we get 
the same formula as for M/M/I. Let's try an example. 


Example Suppose instead of a new, faster disk, we add a second slow disk and duplicate 
the data so that reads can be serviced by either disk. Let's assume that the 
requests are all reads. Recalculate the answers to the earlier questions, this time 
using an M/M/m queue. 


Answer The average utilization of the two disks is then 


Arrival rate x TiMe eyer = 40 « 0.02 


= 0.4 
5 ) 





Server utilization = N 
+ servers 


We first calculate the probability of no tasks in the queue: 





Probo tasks = | be 








1 1 
(2 x Utilization)” F (2 x Utilization)” 
| 2! x (1 — Utilization) n! 
n= 
2 i 
3 r l 
Pate Ma Oo + (2x 04) | =|1+ neon +0.800 | 


= [1 +0.533 + 0.800] | = 2.333 | 
We use this result to calculate the probability of tasks in the queue: 


ne a. 
2 x Utilization 
Probeasks >N es T ZIX — Utilization) | "P0 tasks 


2 
CKUA 93937! = 2:00 agag 


= 3x(1-04) 12 
0.533/2.333 = 0.229 





Finally, the time waiting in the queue: 


Prob 


Ti Tj tasks>N — 
IMC, eye = LIME carver X = eee 
queue server Ngorvers X (1 — Utilization) 
. 0.229 0.229 
= 0. —_—— = 0.020 x —— 
MOA SS —oa ee 


= 0.020 x 0.190 = 0.0038 


The average response time is 20 + 3.8 ms or 23.8 ms. For this workload, two 
disks cut the queue waiting time by a factor of 21 over a single slow disk and a 
factor of 1.75 versus a single fast disk. The mean service time of a system with a 
single fast disk, however, is still 1.4 times faster than one with two disks since the 
disk service time is 2.0 times faster. 
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6.6 


It would be wonderful if we could generalize the M/M/m model to multiple 
queues and multiple servers, as this step is much more realistic. Alas, these mod- 
els are very hard to solve and to use, and so we won't cover them here. 


Crosscutting Issues 


Point-to-Point Links and Switches Replacing Buses 


Point-to-point links and switches are increasing in popularity as Moore's Law 
continues to reduce the cost of components. Combined with the higher I/O band- 
width demands from faster processors, faster disks, and faster local area net- 
works, the decreasing cost advantage of buses means the days of buses in desktop 
and server computers are numbered. This trend started in high-performance com- 
puters in the last edition of the book, and by 2006 has spread itself throughout the 
storage. Figure 6.18 shows the old bus-based standards and their replacements. 

The number of bits and bandwidth for the new generation is per direction, so 
they double for both directions. Since these new designs use many fewer wires, a 
common way to increase bandwidth is to offer versions with several times the num- 
ber of wires and bandwidth. 


Block Servers versus Filers 


Thus far, we have largely ignored the role of the operating system in storage. In a 
manner analogous to the way compilers use an instruction set, operating systems 
determine what I/O techniques implemented by the hardware will actually be 
used. The operating system typically provides the file abstraction on top of 
blocks stored on the disk. The terms logical units, logical volumes, and physical 
volumes are related terms used in Microsoft and UNIX systems to refer to subset 
collections of disk blocks. 

A logical unit is the element of storage exported from a disk array, usually 
constructed from a subset of the array's disks. A logical unit appears to the server 














Max VO 
Standard Width (bits) Length (meters) Clock rate MB/sec devices 
(Parallel) ATA 8 0.5 133 MHz 133 2 
Serial ATA 2 2 3 GHz 300 ? 
SCSI 16 12 80 MHz 320 15 
Serial Attach SCSI 1 10 (DDR) 375 16,256 
PCI 32/64 0.5 33/66 MHz 533 2 
PCI Express 2 0.5 3 GHz 250 ? 





Figure 6.18 Parallel I/O buses and their point-to-point replacements. Note the 
bandwidth and wires are per direction, so bandwidth doubles when sending both 
directions. 
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as a single virtual "disk." In a RAID disk array, the logical unit is configured as a 
particular RAID layout, such as RAID 5. A physical volume is the device file 
used by the file system to access a logical unit. A logical volume provides a level 
of virtualization that enables the file system to split the physical volume across 
multiple pieces or to stripe data across multiple physical volumes. A logical unit 
is an abstraction of a disk array that presents a virtual disk to the operating sys- 
tem, while physical and logical volumes are abstractions used by the operating 
system to divide these virtual disks into smaller, independent file systems. 

Having covered some of the terms for collections of blocks, the question 
arises, Where should the file illusion be maintained: in the server or at the other 
end of the storage area network? 

The traditional answer is the server. It accesses storage as disk blocks and 
maintains the metadata. Most file systems use a file cache, so the server must 
maintain consistency of file accesses. The disks may be direct attached—found 
inside a server connected to an I/O bus—or attached over a storage area network, 
but the server transmits data blocks to the storage subsystem. 

The alternative answer is that the disk subsystem itself maintains the file 
abstraction, and the server uses a file system protocol to communicate with storage. 
Example protocols are Network File System (NFS) for UNIX systems and Com- 
mon Internet File System (CIFS) for Windows systems. Such devices are called 
network attached storage (NAS) devices since it makes no sense for storage to be 
directly attached to the server. The name is something of a misnomer because a 
storage area network like FC-AL can also be used to connect to block servers. The 
term filer is often used for NAS devices that only provide file service and file stor- 
age. Network Appliances was one of the first companies to make filers. 

The driving force behind placing storage on the network is to make it easier 
for many computers to share information and for operators to maintain the shared 
system. 


Asynchronous I/O and Operating Systems 


Disks typically spend much more time in mechanical delays than in transferring 
data. Thus, a natural path to higher I/O performance is parallelism, trying to get 
many disks to simultaneously access data for a program. 

The straightforward approach to I/O is to request data and then start using it. 
The operating system then switches to another process until the desired data 
arrive, and then the operating system switches back to the requesting process. 
Such a style is called synchronous I/O—the process waits until the data have 
been read from disk. 

The alternative model is for the process to continue after making a request, 
and it is not blocked until it tries to read the requested data. Such asynchronous 
I/O allows the process to continue making requests so that many I/O requests 
can be operating simultaneously. Asynchronous I/O shares the same philosophy 
as caches in out-of-order CPUs, which achieve greater bandwidth by having 
multiple outstanding events. 
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6.7 


Designing and Evaluating an I/O System— 
The Internet Archive Cluster 


The art of I/O system design is to find a design that meets goals for cost, depend- 
ability, and variety of devices while avoiding bottlenecks in I/O performance and 
dependability. Avoiding bottlenecks means that components must be balanced 
between main memory and the I/O device, because performance and dependabil- 
ity—and hence effective cost-performance or cost-dependability—can only be as 
good as the weakest link in the I/O chain. The architect must also plan for expan- 
sion so that customers can tailor the I/O to their applications. This expansibility, 
both in numbers and types of I/O devices, has its costs in longer I/O buses and 
networks, larger power supplies to support I/O devices, and larger cabinets. 

In designing an I/O system, we analyze performance, cost, capacity, and 
availability using varying I/O connection schemes and different numbers of I/O 
devices of each type. Here is one series of steps to follow in designing an I/O sys- 
tem. The answers for each step may be dictated by market requirements or sim- 
ply by cost, performance, and availability goals. 


1. List the different types of I/O devices to be connected to the machine, or list 
the standard buses and networks that the machine will support. 


2. List the physical requirements for each I/O device. Requirements include size, 
power, connectors, bus slots, expansion cabinets, and so on. 


3. List the cost of each I/O device, including the portion of cost of any controller 
needed for this device. 


4. List the reliability of each I/O device. 


5. Record the processor resource demands of each I/O device. This list should 
include 


e Clock cycles for instructions used to initiate an I/O, to support operation 
of an I/O device (such as handling interrupts), and to complete I/O 


e Processor clock stalls due to waiting for I/O to finish using the memory, 
bus, or cache 


e Processor clock cycles to recover from an I/O activity, such as a cache 
flush 


6. List the memory and I/O bus resource demands of each I/O device. Even when 
the processor is not using memory, the bandwidth of main memory and the I/O 
connection is limited. 


7. The final step is assessing the performance and availability of the different 
ways to organize these I/O devices. When you can afford it, try to avoid single 
points of failure. Performance can only be properly evaluated with simulation, 
although it may be estimated using queuing theory. Reliability can be calcu- 
lated assuming I/O devices fail independently and that the times to failure are 
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exponentially distributed. Availability can be computed from reliability by esti- 
mating MTTF for the devices, taking into account the time from failure to 
repair. 


Given your cost, performance, and availability goals, you then select the best 
organization, 

Cost-performance goals affect the selection of the I/O scheme and physical 
design. Performance can be measured either as megabytes per second or I/Os per 
second, depending on the needs of the application. For high performance, the 
only limits should be speed of I/O devices, number of I/O devices, and speed of 
memory and processor. For low cost, most of the cost should be the I/O devices 
themselves. Availability goals depend in part on the cost of unavailability to an 
organization. 

Rather than create a paper design, let's evaluate a real system. 


The Internet Archive Cluster 


To make these ideas clearer, we'll estimate the cost, performance, and availability 
of a large storage-oriented cluster at the Internet Archive. The Internet Archive 
began in 1996 with the goal of making a historical record of the Internet as it 
changed over time. You can use the Wayback Machine interface to the Internet 
Archive to perform time travel to see what the Web site at a URL looked like 
some time in the past. In 2006 it contains over a petabyte (10'° bytes) and is 
growing by 20 terabytes (10' bytes) of new data per month, so expansible stor- 
age is a requirement. In addition to storing the historical record, the same hard- 
ware is used to crawl the Web every few months to get snapshots of the Internet. 
Clusters of computers connected by local area networks have become a very 
economical computation engine that work well for some applications. Clusters 
also play an important role in Internet services such the Google search engine, 
where the focus is more on storage than it is on computation, as is the case here. 
Although it has used a variety of hardware over the years, the Internet 
Archive is moving to a new cluster to become more efficient in power and in floor 
space. The basic building block is a 1U storage node called the PetaBox GB2000 
from Capricorn Technologies. In 2006 it contains four 500 GB Parallel ATA 
(PATA) disk drives, 512 MB of DDR266 DRAM, one 10/100/1000 Ethernet 
interface, and a 1 GHz C3 Processor from VIA, which executes the 80x86 
instruction set. This node dissipates about 80 watts in typical configurations. 
Figure 6.19 shows the cluster in a standard VME rack. Forty of the GB2000s 
fit in a standard VME rack, which gives the rack 80 TB of raw capacity. The 40 
nodes are connected together with a 48-port 10/100 or 10/100/1000 switch, and it 
dissipates about 3 KW. The limit is usually 10 KW per rack in computer facili- 
ties, so it is well within the guidelines. 
A petabyte needs 12 of these racks, connected by a higher-level switch that 
connects the Gbit links coming from the switches in each of the racks. 
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Figure 6.19 The TB-80 VME rack from Capricorn Systems used by the Internet 
Archive. All cables, switches, and displays are accessible from the front side, and so the 
back side is only used for airflow. This allows two racks to be placed back-to-back, which 
reduces the floor space demands in machine rooms. 


Estimating Performance, Dependability, and Cost of the 
Internet Archive Cluster 


To illustrate how to evaluate an I/O system, we'll make some guesses about the 
cost, performance, and reliability of the components of this cluster. We make the 
following assumptions about cost and performance: 


The VIA processor, 512 MB of DDR266 DRAM, ATA disk controller, power 
supply, fans, and enclosure costs $500. 


Each of the four 7200 RPM Parallel ATA drives holds 500 GB, has an average 
time seek of 8.5 ms, transfers at 50 MB/sec from the disk, and costs $375. 


The RATA link speed is 133 MB/sec. 


The 48-port 10/100/1000 Ethernet switch and all cables for a rack costs 
$3000. 


The performance of the VIA processor is 1000 MIPS. 
The ATA controller adds 0.1 ms of overhead to perform a disk I/O. 
The operating system uses 50,000 CPU instructions for a disk I/O. 
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The network protocol stacks use 100,000 CPU instructions to transmit a data 
block between the cluster and the external world. 


The average I/O size is 16 KB for accesses to the historical record via the 
Wayback interface, and 50 KB when collecting a new snapshot. 


Example Evaluate the cost per I/O per second (IOPS) of the 80 TB rack. Assume that every 
disk I/O requires an average seek and average rotational delay. Assume the work- 
load is evenly divided among all disks and that all devices can be used at 100% of 
capacity; that is, the system is limited only by the weakest link, and it can operate 
that link at 100% utilization. Calculate for both average I/O sizes. 


Answer I/O performance is limited by the weakest link in the chain, so we evaluate the 
maximum performance of each link in the I/O chain for each organization to 
determine the maximum performance of that organization. 

Let's start by calculating the maximum number of IOPS for the CPU, main 
memory, and I/O bus of one GB2000. The CPU I/O performance is determined 
by the speed of the CPU and the number of instructions to perform a disk I/O and 
to send it over the network: 

1000 MIPS 


Maxi OPS for CPU = : : : : = 6667 IOPS 
Meam TORGE, 50,000 instructions per I/O + 100,000 instructions per message i 5 





The maximum performance of the memory system is determined by the memory 
bandwidth and the size of the I/O transfers: 


266 x 8 


== = 133,00 
16 KB per VO 133,000 IOPS 


Maximum IOPS for main memory = 


266 x 8 


axi S fi ai : p= = 42,5 S 
Maximum IOPS for main memory 50 KB per VO 42,500 IOPS 
The Parallel ATA link performance is limited by the bandwidth and the size of the 
VO: 
e 133 MB/sec 
axi S B = ————. = 8300 IOPS 
Maximum IOPS for the I/O bus 16 KB per I/O 
Maximum IOPS for the 1/0 bus = 222 MB/sec. 9700 TOPS 


50 KB per I/O © 


Since the box has two buses, the I/O bus limits the maximum performance to no 
more than 18,600 IOPS for 16 KB blocks and 5400 IOPS for 50 KB blocks. 
Now it's time to look at the performance of the next link in the I/O chain, the 
ATA controllers. The time to transfer a block over the PATA channel is 
16 KB 


arallel ATA transfer time = —-————. = 0, ms 
Parallel ATA transfer time 133 MB/sec | ms 


s 50 KB 
arallel ATA transfe s = 0.4 ms 
Parallel ATA transfer time 133 MB/sec 
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Adding the 0.1 ms ATA controller overhead means 0.2 ms to 0.5 ms per I/O, 
making the maximum rate per controller 


Maximum IOPS per ATA controller = = 5000 IOPS 





l 
0.2 ms 


Maximum [OPS per ATA controller = a = 2000 IOPS 
0.5 ms 


The next link in the chain is the disks themselves. The time for an average 
disk I/O is 


0.5 16 KB 


7200 RPM * 50 MB/sec 8.5+4.2 + 0.3 = 13.0 ms 


T/O time = 8.5 ms + 


0.5 50 KB 


9S p- OKB_ _ 9:5 3.4.94. 1.0=13.7 ms 
7200 RPM ` 50 MB/sec +1.0=15.7 ms 


I/O time = 8.5 ms + 


Therefore, disk performance is 


Maximum IOPS (using average seeks) per disk = = 77 IOPS 


13.0 ms 


Maximum IOPS (using average seeks) per disk = ae = 73 IOPS 


or 292-308 IOPS for the four disks. 
The final link in the chain is the network that connects the computers to the 
outside world. The link speed determines the limit 


Maximum IOPS per 1000 Mbit Ethernet link = a = 7812 IOPS 
Maximum IOPS per 1000 Mbit Ethernet link = = = 2500 IOPS 


Clearly, the performance bottleneck of the GB2000 is the disks. The IOPS for 
the whole rack is 40 x 308 or 12,320 IOPS to 40 x 292 or 11,680 IOPS. The net- 
work switch would be the bottleneck if it couldn't support 12,320 x 16K x 8 or 
1.6 Gbits/sec for 16 KB blocks and 11,680 x 50K x 8 or 4.7 Gbits/sec for 50 KB 
blocks. We assume that the extra 8 Gbit ports of the 48-port switch connects the 
rack to the rest of the world, so it could support the full IOPS of the collective 
160 disks in the rack. 

Using these assumptions, the cost is 40 x ($500 + 4 x $375) + $3000 + $1500 
or $84,500 for an 80 TB rack. The disks themselves are almost 60% of the cost. 
The cost per terabyte is almost $1000, which is about a factor of 10-15 better 
than storage cluster from the prior edition in 2001. The cost per IOPS is about $7. 


Calculating MTTF of theTB-80 Cluster 


Internet services like Google rely on many copies of the data at the application 
level to provide dependability, often at different geographic sites to protect 
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against environmental faults as well as hardware faults. Hence, the Internet 
Archive has two copies of the data in each site and has sites in San Francisco, 
Amsterdam, and Alexandria, Egypt. Each site maintains a duplicate copy of the 
high-value content—music, books, film, and video—and a single copy of the his- 
torical Web crawls. To keep costs low, there is no redundancy in the 80 TB rack. 


Example Let's look at the resulting mean time to fail of the rack. Rather than use the man- 
ufacturer's quoted MTTF of 600,000 hours, we'll use data from a recent survey 
of disk drives [Gray and van Ingen 2005]. As mentioned in Chapter 1, about 3% 
to 7% of ATA drives fail per year, or an MTTF of about 125,000-300,000 hours. 
Make the following assumptions, again assuming exponential lifetimes: 

e CPU/memory/enclosure MTTF is 1,000,000 hours. 
e PATA Disk MTTF is 125,000 hours. 

e PATA controller MTTF is 500,000 hours. 

e Ethernet Switch MTTF is 500,000 hours. 

e Power supply MTTF is 200,000 hours. 

e Fan MTTF is 200,000 hours. 

e PATA cable MTTF is 1,000,000 hours. 


Answer Collecting these together, we compute these failure rates: 


40 160 40 l 40 40 80 











Failure rate = a 
uilure rate = 399,000 * 125,000 * 500,000 * 500,000 * 200.000 * 200,000 * 1,000,000 
_ 40 + 1280 + 80 + 2 + 200 + 200 + 80 _ 1882 
Š 1,000,000 hours ~ 1,000.000 hours 


The MTTF for the system is just the inverse of the failure rate: 


; l 1,000,000 hours 
l Sen GS 55 s 
MERTE Failure rate 1882 1 hure 
That is, given these assumptions about the MTTF of components, something in a 
rack fails on average every 3 weeks. About 70% of the failures would be the 


disks, and about 20% would be fans or power supplies. 


6.8 Putting It All Together: NetApp FAS6000 Filer 


Network Appliance entered the storage market in 1992 with a goal of providing 
an easy-to-operate file server running NSF using their own log-structured file 
system and a RAID 4 disk array. The company later added support for the Win- 
dows CIFS file system and a RAID 6 scheme called row-diagonal parity or 
RAID-DP (see page 364). To support applications that want access to raw data 
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blocks without the overhead of a file system, such as database systems, NetApp 
filers can serve data blocks over a standard Fibre Channel interface. NetApp also 
supports iSCSI, which allows SCSI commands to run over a TCP/IP network, 
thereby allowing the use of standard networking gear to connect servers to stor- 
age, such as Ethernet, and hence greater distance. 

The latest hardware product is the FAS6000. It is a multiprocessor based on 
the AMD Opteron microprocessor connected using its Hypertransport links. The 
microprocessors run the NetApp software stack, including NSF, CIFS, RAID-DP, 
SCSI, and so on. The FAS6000 comes as either a dual processor (FAS6030) or a 
quad processor (FAS6070). As mentioned in Chapter 4, DRAM is distributed to 
each microprocessor in the Opteron. The FAS6000 connects 8 GB of DDR2700 
to each Opteron, yielding 16 GB for the FAS6030 and 32 GB for the FAS6070. 
As mentioned in Chapter 5, the DRAM bus is 128 bits wide, plus extra bits for 
SEC/DED memory. Both models dedicate four Hypertransport links to I/O. 

As a filer, the FAS6000 needs a lot of I/O to connect to the disks and to con- 
nect to the servers. The integrated I/O consists of 


e 8 Fibre Channel (FC) controllers and ports, 

e 6 Gigabit Ethernet links, 

e 6 slots for x8 (2 GB/sec) PCI Express cards, 

e 3 slots for PCI-X 133 MHz, 64-bit cards, 

e plus standard I/O options like IDE, USB, and 32-bit PCI. 


The 8 Fibre Channel controllers can each be attached to 6 shelves containing 14 
3.5-inch FC disks. Thus, the maximum number of drives for the integrated I/O is 
8 x 6x 14 or 672 disks. Additional FC controllers can be added to the option slots 
to connect up to 1008 drives, to reduce the number of drives per FC network so as 
to reduce contention, and so on. At 500 GB per FC drive in 2006, if we assume 
the RAID RDP group is 14 data disks and 2 check disks, the available data capac- 
ity is 294 TB for 672 disks and 441 TB for 1008 disks. 

It can also connect to Serial ATA disks via a Fibre Channel to SATA bridge 
controller, which, as its name suggests, allows FC and SATA to communicate. 

The six 1-gigabit Ethernet links connect to servers to make the FAS6000 look 
like a file server running if NTFS or CIFS, or like a block server if running iSCSI. 

For greater dependability, FAS6000 filers can be paired so that if one fails, 
the other can take over. Clustered failover requires that both filers have access to 
all disks in the pair of filers using the FC interconnect. This interconnect also 
allows each filer to have a copy of the log data in the NVRAM of the other filer 
and to keep the clocks of the pair synchronized. The health of the filers is con- 
stantly monitored, and failover happens automatically. The healthy filer main- 
tains its own network identity and its own primary functions, but it also assumes 
the network identity of the failed filer and handles all its data requests via a vir- 
tual filer until an administrator restores the data service to the original state. 
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Fallacies and Pitfalls 


Components fail fast. 


A good deal of the fault-tolerant literature is based on the simplifying assumption 
that a component operates perfectly until a latent error becomes effective, and 
then a failure occurs that stops the component. 

The Tertiary Disk project had the opposite experience. Many components 
started acting strangely long before they failed, and it was generally up to the sys- 
tem operator to determine whether to declare a component as failed. The compo- 
nent would generally be willing to continue to act in violation of the service 
agreement until an operator "terminated" that component. 

Figure 6.20 shows the history of four drives that were terminated, and the 
number of hours they started acting strangely before they were replaced. 


Computers systems achieve 99.999% availability ("five nines"), as advertised. 


Marketing departments of companies making servers started bragging about the 
availability of their computer hardware; in terms of Figure 6.21, they claim avail- 
ability of 99.999%, nicknamed five nines. Even the marketing departments of 
operating system companies tried to give this impression. 

Five minutes of unavailability per year is certainly impressive, but given the 
failure data collected in surveys, it's hard to believe. For example, Hewlett- 
Packard claims that the HP-9000 server hardware and HP-UX operating system 
can deliver a 99.999% availability guarantee "in certain pre-defined, pre-tested 
customer environments" (see Hewlett-Packard [1998]). This guarantee does not 
include failures due to operator faults, application faults, or environmental faults, 





Number of Duration 














Messages in system log for failed disk log messages (hours) 
Hardware Failure (Peripheral device write fault 1763 186 
[for] Field Replaceable Unit) 

Not Ready (Diagnostic failure: ASCQ = Component ID 1460 90 
[of] Field Replaceable Unit) 

Recovered Error (Failure Prediction Threshold Exceeded 1313 5 
[for] Field Replaceable Unit) 

Recovered Error (Failure Prediction Threshold Exceeded 431 17 


[for] Field Replaceable Unit) 





Figure 6.20 Record in system log for 4 of the 368 disks in Tertiary Disk that were 
replaced over 18 months. See Talagala and Patterson [1999].These messages, match- 
ing the SCSI specification, were placed into the system log by device drivers. Messages 
started occurring as much as a week before one drive was replaced by the operator. 
The third and fourth messages indicate that the drive's failure prediction mechanism 
detected and predicted imminent failure, yet it was still hours before the drives were 
replaced by the operator. 
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Pitfall 


























Unavailability Availability Availability class 
(minutes per year) (percent) ("number of nines") 
50,000 90% 1 
5,000 99% 2 
500 99.9% 3 
50 99.99% 4 
5 99.999% 5 
0.5 99.9999% 6 
0.05 99.99999% 7 





Figure 6.21 Minutes unavailable per year to achieve availability class (from Gray 
and Siewiorek [1991 ]). Note that five nines mean unavailable five minutes per year. 


which are likely the dominant fault categories today. Nor does it include sched- 
uled downtime. It is also unclear what the financial penalty is to a company if a 
system does not match its guarantee. 

Microsoft also promulgated a five nines marketing campaign. In January 
2001, www.microsoft.com was unavailable for 22 hours. For its Web site to 
achieve 99.999% availability, it will require a clean slate for 250 years. 

In contrast to marketing suggestions, well-managed servers in 2006 typically 
achieve 99% to 99.9% availability. 





Where a function is implemented affects its reliability. 


In theory, it is fine to move the RAID function into software. In practice, it is very 
difficult to make it work reliably. 

The software culture is generally based on eventual correctness via a series of 
releases and patches. It is also difficult to isolate from other layers of software. 
For example, proper software behavior is often based on having the proper ver- 
sion and patch release of the operating system. Thus, many customers have lost 
data due to software bugs or incompatibilities in environment in software RAID 
systems. 

Obviously, hardware systems are not immune to bugs, but the hardware cul- 
ture tends to place a greater emphasis on testing correctness in the initial release. 
In addition, the hardware is more likely to be independent of the version of the 
operating system. 


Fallacy Operating systems are the best place to schedule disk accesses. 


Higher-level interfaces like ATA and SCSI offer logical block addresses to the 
host operating system. Given this high-level abstraction, the best an OS can do is 
to try to sort the logical block addresses into increasing order. Since only the disk 
knows the mapping of the logical addresses onto the physical geometry of sec- 
tors, tracks, and surfaces, it can reduce the rotational and seek latencies. 
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For example, suppose the workload is four reads [Anderson 2003]: 

















Operation Starting LBA Length 
Read 724 8 
Read 100 16 
Read 9987 1 
Read 26 128 








The host might reorder the four reads into logical block order: 














Read 26 128 
Read 100 16 
Read 724 8 
Read 9987 1 





Depending on the relative location of the data on the disk, reordering could make 
it worse, as Figure 6.22 shows. The disk-scheduled reads complete in three-quar- 
ters of a disk revolution, but the OS-scheduled reads take three revolutions. 


Fallacy The time of an average seek of a disk in a computer system is the time for a seek of 
one-third the number of cylinders. 


This fallacy comes from confusing the way manufacturers market disks with the 
expected performance, and from the false assumption that seek times are linear in 
distance. The one-third-distance rule of thumb comes from calculating the 
distance of a seek from one random location to another random location, not 
including the current track and assuming there are a large number of tracks. In 





— Host-ordered queue 
—> Drive-ordered queue 








Figure 6.22 Example showing OS versus disk schedule accesses, labeled host- 
ordered versus drive-ordered.The former takes 3 revolutions to complete the 4 reads, 
while the latter completes them in just 3/4 of a revolution. From Anderson [2003]. 
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Access time (ms) 


the past, manufacturers listed the seek of this distance to offer a consistent basis 
for comparison. (Today they calculate the "average" by timing all seeks and 
dividing by the number.) Assuming (incorrectly) that seek time is linear in dis- 
tance, and using the manufacturer's reported minimum and "average" seek times, 
a common technique to predict seek time is 

Distance 


x (Time — Time 


Time = Time average minimum? 


seek minimum Distance, vorage 

The fallacy concerning seek time is twofold. First, seek time is not linear with 
distance; the arm must accelerate to overcome inertia, reach its maximum travel- 
ing speed, decelerate as it reaches the requested position, and then wait to allow 
the arm to stop vibrating (settle time). Moreover, sometimes the arm must pause 
to control vibrations. For disks with more than 200 cylinders, Chen and Lee 


[1995] modeled the seek distance as 


Seek time( Distance) = a X ,/Distance - | +b x (Distance — 1) +€ 


where a, b, and c are selected for a particular disk so that this formula will match 
the quoted times for Distance = 1, Distance = max, and Distance = 1/3 max. Fig- 
ure 6.23 plots this equation versus the fallacy equation. Unlike the first equation, 
the square root of the distance reflects acceleration and deceleration. 

The second problem is that the average in the product specification would 
only be true if there were no locality to disk activity. Fortunately, there is both 
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Figure 6.23 Seek time versus seek distance for sophisticated model versus naive model. Chen and Lee [1995] 
found that the equations shown above for parameters a, b, and c worked well for several disks. 
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Figure 6.24 Sample measurements of seek distances for two systems. The measurements on the left were taken 
on a UNIX time-sharing system.The measurements on the right were taken from a business-processing application 
in which the disk seek activity was scheduled to improve throughput. Seek distance of 0 means the access was made 
to the same cylinder.The rest of the numbers show the collective percentage for distances between numbers on the 
y-axis. For example, 11 % for the bar labeled 16 in the business graph means that the percentage of seeks between 1 
and 16 cylinders was 11%.The UNIX measurements stopped at 200 of the 1000 cylinders, but this captured 85% of 
the accesses.The business measurements tracked all 816 cylinders of the disks.The only seek distances with 1%or 
greater of the seeks that are not in the graph are 224 with 4%, and 304,336,512, and 624, each having 1 %.This total 


is 94%, with the difference being small but nonzero distances in other categories. Measurements courtesy of Dave 
Anderson of Seagate. 


temporal and spatial locality (see page C-2 in Appendix C). For example, 
Figure 6.24 shows sample measurements of seek distances for two workloads: a 
UNIX time-sharing workload and a business-processing workload. Notice the 
high percentage of disk accesses to the same cylinder, labeled distance O in the 
graphs, in both workloads. Thus, this fallacy couldn't be more misleading. 


6.10 


Concluding Remarks 


Storage is one of those technologies that we tend to take for granted. And yet, if 
we look at the true status of things today, storage is king. One can even argue that 
servers, which have become commodities, are now becoming peripheral to 
storage devices. Driving that point home are some estimates from IBM, which 
expects storage sales to surpass server sales in the next two years. 


Michael Vizard 


editor in chief, Infoworld, August 11,2001 
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As their value is becoming increasingly evident, storage systems have become 
the target of innovation and investment. 

The challenge for storage systems today is dependability and maintainability. 
Not only do users want to be sure their data are never lost (reliability), applica- 
tions today increasingly demand that the data are always available to access 
(availability). Despite improvements in hardware and software reliability and 
fault tolerance, the awkwardness of maintaining such systems is a problem both 
for cost and for availability. A widely mentioned statistic is that customers spend 
$6 to $8 operating a storage system for every $1 of purchase price. When depend- 
ability is attacked by having many redundant copies at a higher level of the sys- 
tem—such as for search—then very large systems can be sensitive to the price- 
performance of the storage components. 

Today, challenges in storage dependability and maintainability dominate the 
challenges of I/O. 


Historical Perspective and References 


oO 
— 
— 


Section K.7 on the companion CD covers the development of storage devices and 
techniques, including who invented disks, the story behind RAID, and the history 
of operating systems and databases. References for further reading are included. 


Case Studies with Exercises by Andrea C. Arpaci-Dusseau 
and Remzi H. Arpaci-Dusseau 


Case Study 1: Deconstructing a Disk 


Concepts illustrated by this case study 


e Performance Characteristics 


e Microbenchmarks 


The internals of a storage system tend to be hidden behind a simple interface, that 
of a linear array of blocks. There are many advantages to having a common inter- 
face for all storage systems: an operating system can use any storage system 
without modification, and yet the storage system is free to innovate behind this 
interface. For example, a single disk can map its internal <sector, track, surface> 
geometry to the linear array in whatever way achieves the best performance; sim- 
ilarly, a multidisk RAID system can map the blocks on any number of disks to 
this same linear array. However, this fixed interface has a number of disadvan- 
tages as well; in particular, the operating system is not able to perform some per- 
formance, reliability, and security optimizations without knowing the precise 
layout of its blocks inside the underlying storage system. 
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In this case study, we will explore how software can be used to uncover the 
internal structure of a storage system hidden behind a block-based interface. The 
basic idea is to fingerprint the storage system: by running a well-defined work- 
load on top of the storage system and measuring the amount of time required for 
different requests, one is able to infer a surprising amount of detail about the 
underlying system. 

The Skippy algorithm, from work by Nisha Talagala and colleagues at U.C. 
Berkeley, uncovers the parameters of a single disk. The key is to factor out disk 
rotational effects by making consecutive seeks to individual sectors with 
addresses that differ by a linearly increasing amount (increasing by 1,2, 3, and so 
forth). Thus, the basic algorithm skips through the disk, increasing the distance of 
the seek by one sector before every write, and outputs the distance and time for 
each write. The raw device interface is used to avoid file system optimizations. 
The SECTOR SIZE is set equal to the minimum amount of data that can be read at 
once from the disk (e.g., 512 bytes). (Skippy is described in more detail in Tala- 
gala etal. [1999].) 


fd = open("raw disk device"); 

for (i =0; i < measurements; i++) { 
begin_time = gettimeQ; 
lseek (fd, i*SECTOR_SIZE, SEEK_CUR); 
write(fd, buffer, SECTOR SIZE); 
interval_time = gettimeQ -begin_time; 


printf("Stride: %d Time: %d\n", i, interval_time); 


} 


close(fd); 


By graphing the time required for each write as a function of the seek dis- 
tance, one can infer the minimal transfer time (with no seek or rotational latency), 
head switch time, cylinder switch time, rotational latency, and the number of 
heads in the disk. A typical graph will have four distinct lines, each with the same 
slope, but with different offsets. The highest and lowest lines correspond to 
requests that incur different amounts of rotational delay, but no cylinder or head 
switch costs; the difference between these two lines reveals the rotational latency 
of the disk. The second lowest line corresponds to requests that incur a head 
switch (in addition to increasing amounts of rotational delay). Finally, the third 
line corresponds to requests that incur a cylinder switch (in addition to rotational 
delay). 


[10/10/10/10/10] <6.2> The results of running Skippy are shown for a mock disk 
(Disk Alpha) in Figure 6.25. 


a. [10] <6.2> What is the minimal transfer time? 
b. [10] <6.2> What is the rotational latency? 
c. [10] <6.2> What is the head switch time? 
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Figure 6.25 Results from running Skippy on Disk Alpha. 


d. 


e. 


[10] <6.2> What is the cylinder switch time? 
[10] <6.2> What is the number of disk heads? 


[25] <6.2> Draw an approximation of the graph that would result from running 
Skippy on Disk Beta, a disk with the following parameters: 


Minimal transfer time: 2.0 ms 
Rotational latency: 6.0 ms 
Head switch time: 1.0 ms 
Cylinder switch time: 1.5 ms 
Number of disk heads: 4 
Sectors per track: 100 


[10/10/10/10/10/10/10] <6.2> Implement and run the Skippy algorithm on a disk 
drive of your choosing. 


a. 


[10] <6.2> Graph the results of running Skippy. Report the manufacturer and 
model of your disk. 


[10] <6.2> What is the minimal transfer time? 


[ 
[ 
[ 


10] <6.2> What is the rotational latency? 
] <6.2> What is the head switch time? 
10] <6.2> What is the cylinder switch time? 
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f. [10] <6.2> What is the number of disk heads? 


g. [10] <6.2> Do the results of running Skippy on a real disk differ in any quali- 
tative way from that of the mock disk? 


Case Study 2: Deconstructing a Disk Array 


Concepts illustrated by this case study 


e Performance Characteristics 


e Microbenchmarks 


The Shear algorithm, from work by Timothy Denehy and colleagues at the Uni- 
versity of Wisconsin [Denehy et al. 2004], uncovers the parameters of a RAID 
system. The basic idea is to generate a workload of requests to the RAID array 
and time those requests; by observing which sets of requests take longer, one can 
infer which blocks are allocated to the same disk. 

We define RAID properties as follows. Data is allocated to disks in the RAID 
at the block level, where a block is the minimal unit of data that the file system 
reads or writes from the storage system; thus, block size is known by the file sys- 
tem and the fingerprinting software. A chunk is a set of blocks that is allocated 
contiguously within a disk. A stripe is a set of chunks across each of D data disks. 
Finally, a pattern is the minimum sequence of data blocks such that block offset i 
within the pattern is always located on diskj. 


[20/20] <6.2> One can uncover the pattern size with the following code. The 
code accesses the raw device to avoid file system optimizations. The key to all of 
the Shear algorithms is to use random requests to avoid triggering any of the 
prefetch or caching mechanisms within the RAID or within individual disks. The 
basic idea of this code sequence is to access N random blocks at a fixed interval p 
within the RAID array and to measure the completion time of each interval. 


for (p = BLOCKSIZE; p <= testsize; p += BLOCKSIZE) { 
for (i = 0; i < N; i++) { 
request[i] = random()*p; 
} 
begin_time = gettime(); 
issues all request[N] to raw device in parallel; 


wait for all request{[N] to complete; 
interval_time = gettime() - begin_time; 
printf("PatternSize: %d Time: %d\n", p, 
iinterval_time); 
} 
If you run this code on a RAID array and plot the measured time for the N 
requests as a function of p, then you will see that the time is highest when all NV 
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Figure 6.26 Results from running the pattern size algorithm of Shear on a mock storage system. 


requests fall on the same disk; thus, the value of p with the highest time corre- 
sponds to the pattern size of the RAID. 


a. [20] <6.2> Figure 6.26 shows the results of running the pattern size algorithm 
on an unknown RAID system. 


e What is the pattern size of this storage system? 


e What do the measured times of 0.4, 0.8, and 1.6 seconds correspond to in 
this storage system? 


e Ifthis is a RAID 0 array, then how many disks are present? 
e If this is a RAID 0 array, then what is the chunk size? 
b. [20] <6.2> Draw the graph that would result from running this Shear code on 
a storage system with the following characteristics: 
e Number of requests: N= 1000 
e Time for a random read on disk: 5 ms 
e RAID level: RAID 0 
e Number of disks: 4 
e Chunk size: 8 KB 


6.5 [20/20] <6.2> One can uncover the chunk size with the following code. The basic 
idea is to perform reads from N patterns chosen at random, but always at con- 
trolled offsets, c and c — 1, within the pattern. 


for (c = 0; c < patternsize; c += BLOCKSIZE) { 
for (i = 0; i < N; i++) { 
requestA[i] = random()*patternsize + C; 
requestB[i] = random()*patternsize + 
(c-l)%patternsize; 
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Figure 6.27 Results from running the chunk size algorithm of Shear on a mock stor- 
age system. 


begin_time = gettime(); 


issue all requestA[N] and requestB[N] to raw device 
in parallel; 
wait for requestA[N] and requestB[N] to complete; 


interval_time = gettime() - begin_time; 
printf("ChunkSize: %d Time: %d\n", c, interval_time); 


} 


If you run this code and plot the measured time as a function of c, then you will 
see that the measured time is lowest when the requestA and requestB reads fall on 
two different disks. Thus, the values of c with low times correspond to the chunk 
boundaries between disks of the RAID. 


a. 


[20] <6.2> Figure 6.27 shows the results of running the chunk size algorithm 
on an unknown RAID system. 


e What is the chunk size of this storage system? 


e What do the measured times of 0.75 and 15 seconds correspond to in this 
storage system? 


[20] <6.2> Draw the graph that would result from running this Shear code on 
a storage system with the following characteristics: 


e Number of requests: N= 1000 

e Time for a random read on disk: 5 ms 
e RAID level: RAID 0 

e Number of disks: 8 

e Chunk size: 12 KB 


[10/10/10/10] <6.2> Finally, one can determine the layout of chunks to disks 
with the following code. The basic idea is to select N random patterns, and to 
exhaustively read together all pairwise combinations of the chunks within the 
pattern. 
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for (a = 0; a < numchunks; a += chunksize) { 
for (b = a; b < numchunks; b += chunksize) { 


for (i = 0; 1 < N; i++) { 
requestA[i] = random()*patternsize + a; 
requestB[i] = random()*patternsize + b; 

} 

begin_time = gettime(); 

issue all requestA[N] and requestB[N] to raw device 

in parallel; 

wait for all requestA[N] and requestB[N] to 
complete; 

interval_time = gettime() - begin_time; 

printf("A: %d B: %d Time: %d\n", a, b, 
interval_time); 


} 


After running this code, you can report the measured time as a function of a and 
b. The simplest way to graph this is to create a two-dimensional table with a and 
b as the parameters, and the time scaled to a shaded value; we use darker shad- 
ings for faster times and lighter shadings for slower times. Thus, a light shading 
indicates that the two offsets of a and b within the pattern fall on the same disk. 


Figure 6.28 shows the results of running the layout algorithm on a storage system 
that is known to have a pattern size of 384 KB and a chunk size of 32 KB. 


a. [20] <6.2> How many chunks are in a pattern? 


b. [20] <6.2> Which chunks of each pattern appear to be allocated on the same 
disks? 


Chunk 





Chunk 


Figure 6.28 Results from running the layout algorithm of Shear on a mock storage 
system. 
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Parity: RAID 5 Left-Asymmetric, stripe = 16, pattern = 48 


Figure 6.29 A storage system with 4 disks, a chunk size of four 4 KB blocks, and 
using a RAID 5 Left-Asymmetric layout.Two repetitions of the pattern are shown. 


c. [20] <6.2> How many disks appear to be in this storage system? 
d. [20] <6.2> Draw the likely layout of blocks across the disks. 


[20] <6.2> Draw the graph that would result from running the layout algorithm 
on the storage system shown in Figure 6.29. This storage system has 4 disks, a 
chunk size of four 4 KB blocks (16 KB), and is using a RAID 5 Left-Asymmetric 
layout. 


Case Study 3: RAID Reconstruction 


Concepts illustrated by this case study 


m RAID Systems 

e RAID Reconstruction 

e Mean Time to Failure (MTTF) 

e Mean Time until Data Loss (MTDL) 
e  Performability 


e Double Failures 


A RAID system ensures that data is not lost when a disk fails. Thus, one of the 
key responsibilities of a RAID is to reconstruct the data that was on a disk when 
it failed; this process is called reconstruction and is what you will explore in this 
case study. You will consider both a RAID system that can tolerate one disk fail- 
ure, and a RAID-DP, which can tolerate two disk failures. 

Reconstruction is commonly performed in two different ways. In off-line 
reconstruction, the RAID devotes all of its resources to performing reconstruc- 
tion and does not service any requests from the workload. In on-line reconstruc- 
tion, the RAID continues to service workload requests while performing the 
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6.8 


6.9 


6.11 


reconstruction; the reconstruction process is often limited to use some fraction of 
the total bandwidth of the RAID system. 

How reconstruction is performed impacts both the reliability and the per- 
formability of the system. In a RAID 5, data is lost if a second disk fails before 
the data from the first disk is recovered; therefore, the longer the reconstruction 
time (MTTR), the lower the reliability or the mean time until data loss (MTDL). 
Performability is a metric meant to combine both the performance of a system 
and its availability; it is defined as the performance of the system in a given state 
multiplied by the probability of that state. For a RAID array, possible states 
include normal operation with no disk failures, reconstruction with one disk fail- 
ure, and shutdown due to multiple disk failures. 

For these exercises, assume that you have built a RAID system with six disks, 
plus a sufficient number of hot spares. Assume each disk is the 37 GB SCSI disk 
shown in Figure 6.3; assume each disk can sequentially read data at a peak of 142 
MB/sec and sequentially write data at a peak of 85 MB/sec. Assume that the 
disks are connected to an Ultra320 SCSI bus that can transfer a total of 320 MB/ 
sec. You can assume that each disk failure is independent and ignore other poten- 
tial failures in the system. For the reconstruction process, you can assume that the 
overhead for any XOR computation or memory copying is negligible. During 
online reconstruction, assume that the reconstruction process is limited to use a 
total bandwidth of 10 MB/sec from the RAID system. 


[10] <6.2> Assume that you have a RAID 4 system with six disks. Draw a simple 
diagram showing the layout of blocks across disks for this RAID system. 


[10] <6.2, 6.4> When a single disk fails, the RAID 4 system will perform recon- 
struction. What is the expected time until a reconstruction is needed? 


[10/10/10] <6.2, 6.4> Assume that reconstruction of the RAID 4 array begins at 
time t. 


a. [10] <6.2, 6.4> What read and write operations are required to perform the 
reconstruction? 


b. [10] <6.2, 6.4> For offline reconstruction, when will the reconstruction pro- 
cess be complete? 


c. [10] <6.2, 6.4> For online reconstruction, when will the reconstruction pro- 
cess be complete? 


[10/10/10/10] <6.2, 6.4> In this exercise, we will investigate the mean time until 
data loss (MTDL). In RAID 4, data is lost only if a second disk fails before the 
first failed disk is repaired. 


a. [10] <6.2, 6.4> What is the likelihood of having a second failure during off- 
line reconstruction? 

b. [10] <6.2, 6.4> Given this likelihood of a second failure during reconstruc- 
tion, what is the MTDL for offline reconstruction? 


c. [10] <6.2, 6.4> What is the likelihood of having a second failure during 
online reconstruction? 
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d. [10] <6.2, 6.4> Given this likelihood of a second failure during reconstruc- 
tion, what is the MTDL for online reconstruction? 


[10] <6.2, 6.4> What is performability for the RAID 4 array for offline recon- 
struction? Calculate the performability using IOPS, assuming a random read- 
only workload that is evenly distributed across the disks of the RAID 4 array. 


[10] <6.2, 6.4> What is the performability for the RAID 4 array for online recon- 
struction? During online repair, you can assume that the IOPS drop to 70% of then- 
peak rate. Does offline or online reconstruction lead to better performability? 


[10] <6.2, 64> RAID 6 is used to tolerate up to two simultaneous disk failures. 
Assume that you have a RAID 6 system based on row-diagonal parity, or RAID- 
DP; your six-disk RAID-DP system is based on RAID 4, withp = 5, as shown in 
Figure 6.5. If data disk O and data disk 3 fail, how can those disks be recon- 
structed? Show the sequence of steps that are required to compute the missing 
blocks in the first four stripes. 


Case Study 4: Performance Prediction for RAIDs 


Concepts illustrated by this case study 


m RAID Levels 

e Queuing Theory 

e Impact of Workloads 

e Impact of Disk Layout 


In this case study, you will explore how simple queuing theory can be used to 
predict the performance of the I/O system. You will investigate how both storage 
system configuration and the workload influence service time, disk utilization, 
and average response time. 

The configuration of the storage system has a large impact on performance. 
Different RAID levels can be modeled using queuing theory in different ways. 
For example, a RAID 0 array containing N disks can be modeled as N separate 
systems of M/M/I queues, assuming that requests are appropriately distributed 
across the N disks. The behavior of a RAID 1 array depends upon the work- 
load: a read operation can be sent to either mirror, whereas a write operation 
must be sent to both disks. Therefore, for a read-only workload, a two-disk 
RAID 1 array can be modeled as an M/M/2 queue, whereas for a write-only 
workload, it can be modeled as an M/M/1 queue. The behavior of a RAID 4 
array containing TV disks also depends upon the workload: a read will be sent to 
a particular data disk, whereas writes must all update the parity disk, which 
becomes the bottleneck of the system. Therefore, for a read-only workload, 
RAID 4 can be modeled as TV - 1 separate systems, whereas for a write-only 
workload, it can be modeled as one M/M/I queue. 

The layout of blocks within the storage system can have a significant impact 
on performance. Consider a single disk with a 40 GB capacity. If the workload 
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randomly accesses 40 GB of data, then the layout of those blocks to the disk does 
not have much of an impact on performance. However, if the workload randomly 
accesses only half of the disk's capacity (i.e., 20 GB of data on that disk), then 
layout does matter: to reduce seek time, the 20 GB of data can be compacted 
within 20 GB of consecutive tracks instead of allocated uniformly distributed 
over the entire 40 GB capacity. 

For this problem, we will use a rather simplistic model to estimate the service 
time of a disk. In this basic model, the average positioning and transfer time for a 
small random request is a linear function of the seek distance. For the 40 GB disk 
in this problem, assume that the service time is 5 ms * space utilization. Thus, if 
the entire 40 GB disk is used, then the average positioning and transfer time for a 
random request is 5 ms; if only the first 20 GB of the disk is used, then the aver- 
age positioning and transfer time is 2.5 ms. 

Throughout this case study, you can assume that the processor sends 167 
small random disk requests per second and that these requests are exponentially 
distributed. You can assume that the size of the requests is equal to the block size 
of 8 KB. Each disk in the system has a capacity of 40 GB. Regardless of the stor- 
age system configuration, the workload accesses a total of 40 GB of data; you 
should allocate the 40 GB of data across the disks in the system in the most effi- 
cient manner. 


[10/10/10/10/10] <6.5> Begin by assuming that the storage system consists of a 
single 40 GB disk. 


a. [10] <6.5> Given this workload and storage system, what is the average ser- 
vice time? 
b. [10] <6.5> On average, what is the utilization of the disk? 


c. [10] <6.5> On average, how much time does each request spend waiting for 
the disk? 


d. [10] <6.5> What is the mean number of requests in the queue? 
e. [10] <6.5> Finally, what is the average response time for the disk requests? 


[10/10/10/10/10/10] <6.2, 6.5> Imagine that the storage system is now config- 
ured to contain two 40 GB disks in a RAID 0 array; that is, the data is striped in 
blocks of 8 KB equally across the two disks with no redundancy. 


a. [10] <6.2, 6.5> How will the 40 GB of data be allocated across the disks? 
Given a random request workload over a total of 40 GB, what is the expected 
service time of each request? 


b. [10] <6.2, 6.5> How can queuing theory be used to model this storage system? 
c. [10] <6.2, 6.5> What is the average utilization of each disk? 


d. [10] <6.2, 6.5> On average, how much time does each request spend waiting 
for the disk? 


e. [10] <6.2, 6.5> What is the mean number of requests in each queue? 


f. [10] <6.2, 65> Finally, what is the average response time for the disk 
requests? 
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[20/20/20/20/20] <6.2, 6.5> Instead imagine that the storage system is configured 
to contain two 40 GB disks in a RAID 1 array; that is, the data is mirrored across 
the two disks. Use queuing theory to model this system for a read-only workload. 


a. [20] <6.2, 6.5> How will the 40 GB of data be allocated across the disks? 
Given a random request workload over a total of 40 GB, what is the expected 
service time of each request? 


b. [20] <6.2, 6.5> How can queuing theory be used to model this storage sys- 
tem? 

c. [20] <6.2, 6.5> What is the average utilization of each disk? 

d. [20] <6.2, 6.5> On average, how much time does each request spend waiting 
for the disk? 


e. [20] <6.2, 65> Finally, what is the average response time for the disk 
requests? 


[10/10] <6.2, 6.5> Imagine that instead of a read-only workload, you now have a 
write-only workload on a RAID 1 array. 


a. [10] <6.2, 6.5> Describe how you can use queuing theory to model this sys- 
tem and workload. 


b. [10] <6.2, 6.5> Given this system and workload, what is the average utiliza- 
tion, average waiting time, and average response time? 


Case Study 5: I/O Subsystem Design 


Concepts illustrated by this case study 


m RAID Systems 
e Mean Time to Failure (MTTF) 


e Performance and Reliability Trade-offs 


In this case study, you will design an I/O subsystem, given a monetary budget. 
Your system will have a minimum required capacity and you will optimize for 
performance, reliability, or both. You are free to use as many disks and controllers 
as fit within your budget. 

Here are your building blocks: 


e A 10,000 MIPS CPU costing $1000. Its MTTF is 1,000,000 hours. 

e A 1000 MB/sec I/O bus with room for 20 Ultra320 SCSI buses and control- 
lers. 

e Ultra320 SCSI buses that can transfer 320 MB/sec and support up to 15 disks 
per bus (these are also called SCSI strings). The SCSI cable MTTF is 
1,000,000 hours. 
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m An Ultra320 SCSI controller that is capable of 50,000 IOPS, costs $250, and 
has an MTTF of 500,000 hours. 


e A $2000 enclosure supplying power and cooling to up to eight disks. The 
enclosure MTTF is 1,000,000 hours, the fan MTTF is 200,000 hours, and the 
power supply MTTF is 200,000 hours. 


e The SCSI disks described in Figure 6.3. 


e Replacing any failed component requires 24 hours. 
You may make the following assumptions about your workload: 


e The operating system requires 70,000 CPU instructions for each disk I/O. 


e The workload consists of many concurrent, random I/Os, with an average 
size of 16 KB. 


All of your constructed systems must have the following properties: 


e You have a monetary budget of $28,000. 

e You must provide at least 1 TB of capacity. 

[10] <6.2> You will begin by designing an I/O subsystem that is optimized only 

for capacity and performance (and not reliability), specifically IOPS. Discuss the 

RAID level and block size that will deliver the best performance. 

[20/20/20/20] <6.2, 6.4, 6.7> What configuration of SCSI disks, controllers, and 

enclosures results in the best performance given your monetary and capacity con- 

straints? 

a. [20] <6.2, 6.4, 6.7> How many IOPS do you expect to deliver with your sys- 
tem? 

b. [20] <6.2, 6.4, 6.7> How much does your system cost? 

c. [20] <6.2, 6.4, 6.7> What is the capacity of your system? 

d. [20] <6.2, 6.4, 6.7> What is the MTTF of your system? 

[10] <6.2, 6.4, 6.7> You will now redesign your system to optimize for reliability, 

by creating a RAID 10 or RAID 01 array. Your storage system should be robust 

not only to disk failures, but to controller, cable, power supply, and fan failures as 

well; specifically, a single component failure should not prohibit accessing both 

replicas of a pair. Draw a diagram illustrating how blocks are allocated across 

disks in the RAID 10 and RAID 01 configurations. Is RAID 10 or RAID 01 more 

appropriate in this environment? 

[20/20/20/20/20] <6.2, 6.4, 6.7> Optimizing your RAID 10 or RAID 01 array 

only for reliability (but keeping within your capacity and monetary constraints), 

what is your RAID configuration? 

a. [20] <6.2, 6.4, 6.7> What is the overall MTTF of the components in your 
system? 
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b. [20] <6.2, 6.4, 6.7> What is the MTDL of your system? 

c. [20] <6.2, 6.4, 6.7> What is the usable capacity of this system? 

d. [20] <6.2, 6.4, 6.7> How much does your system cost? 

e. [20] <6.2, 6.4, 6.7> Assuming a write-only workload, how many IOPS can 


you expect to deliver? 


[10] <6.2, 6.4, 6.7> Assume that you now have access to a disk that has twice the 
capacity, for the same price. If you continue to design only for reliability, how 
would you change the configuration of your storage system? Why? 


Case Study 6: Dirty Rotten Bits 


Concepts illustrated by this case study 


m Partial Disk Failure 

e Failure Analysis 

e Performance Analysis 
e Parity Protection 


e Checksumming 


You are put in charge of avoiding the problem of "bit rot"—bits or blocks in a file 
going bad over time. This problem is particularly important in archival scenarios, 
where data is written once and perhaps accessed many years later, without taking 
extra measures to protect the data, the bits or blocks of a file may slowly change 
or become unavailable due to media errors or other I/O faults. 

Dealing with bit rot requires two specific components: detection and recov- 
ery. To detect bit rot efficiently, one can use checksums over each block of the file 
in question; a checksum is just a function of some kind that takes a (potentially 
long) string of data as input and outputs a fixed-size string (the checksum) of the 
data as output. The property you will exploit is that if the data changes, the com- 
puted checksum is very likely to change as well. 

Once detected, recovering from bit rot requires some form of redundancy. 
Examples include mirroring (keeping multiple copies of each block) and parity 
(some extra redundant information, usually more space efficient than mirroring). 

In this case study, you will analyze how effective these techniques are given 
various scenarios. You will also write code to implement data integrity protection 
over a set of files. 


[20/20/20] <6.2> Assume that you will use simple parity protection in Exercises 
6.24 through 6.27. Specifically, assume that you will be computing one parity 
block for each file in the file system. Further, assume that you will also use a 20- 
byte MD5 checksum per 4 KB block of each file. 
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We first tackle the problem of space overhead. According to recent studies [Dou- 
ceur and Bolosky 1999], these file size distributions are what is found in modern 
PCs: 


<1KB | 2KB | 4KB | 8KB | 16KB [32KB | 64KB | 128KB | 256KB | 512KB | >1MB 





26.6% 





6.25 


11.0% 


10.9% 


11.2% 9.5% | 8.5% | 7.1% | 5.1% | 3.7% | 2.4% | 4.0% 











The study also finds that file systems are usually about half full. Assume you 
have a 37 GB disk volume that is roughly half full and follows that same distribu- 
tion, and answer the following questions: 


a. [20] <6.2> How much extra information (both in bytes and as a percent of the 
volume) must you keep on disk to be able to detect a single error with check- 
sums? 


b. [20] <6.2> How much extra information (both in bytes and as a percent of the 
volume) would you need to be able to both detect a single error with check- 
sums as well as correct it? 


c. [20] <6.2> Given this file distribution, is the block size you are using to com- 
pute checksums too big, too little, or just right? 


[10/10] <6.2, 6.3> One big problem that arises in data protection is error detec- 
tion. One approach is to perform error detection /azily—that is, wait until a file is 
accessed, and at that point, check it and make sure the correct data is there. The 
problem with this approach is that files that are not accessed frequently may thus 
slowly rot away, and when finally accessed, have too many errors to be corrected. 
Hence, an eager approach is to perform what is sometimes called disk scrub- 
bing—periodically go through all data and find errors proactively. 


a. [10] <6.2, 6.3> Assume that bit flips occur independently, at a rate of 1 flip 
per GB of data per month. Assuming the same 20 GB volume that is half full, 
and assuming that you are using the SCSI disk as specified in Figure 6.3 
(4ms seek, roughly 100 MB/sec transfer), how often should you scan 
through files to check and repair their integrity? 


b. [10] <6.2, 6.3> At what bit flip rate does it become impossible to maintain 
data integrity? Again assume the 20 GB volume and the SCSI disk. 


[10/10/10/10] <6.2, 6.4> Another potential cost of added data protection is found 
in performance overhead. We now study the performance overhead of this data 
protection approach. 


a. [10] <6.2, 6.4> Assume we write a 40 MB file to the SCSI disk sequentially, 
and then write out the extra information to implement our data protection 
scheme to disk once. How much write traffic (both in total volume of bytes 
and as a percentage of total traffic) does our scheme generate? 


b. [10] <6.2, 6.4> Assume we now are updating the file randomly, similar to a 
database table. That is, assume we perform a series of 4 KB random writes to 
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the file, and each time we perform a single write, we must update the on-disk 
protection information. Assuming that we perform 10,000 random writes, 
how much //O traffic (both in total volume of bytes and as a percentage of 
total traffic) does our scheme generate? 


c. [10] <6.2, 6.4> Now assume that the data protection information is always 
kept in a separate portion of the disk, away from the file it is guarding (that is, 
assume for each file A, there is another file A checksums that holds all the check- 
sums for A). Hence, one potential overhead we must incur arises upon 
reads—that is, upon each read, we will use the checksum to detect data cor- 
ruption. 

Assume you read 10,000 blocks of 4 KB each sequentially from disk. Assum- 
ing a 4 ms average seek cost and a 100 MB/sec transfer rate (like the SCSI 


disk in Figure 6.3), how long will it take to read the file (and corresponding 
checksums) from disk? What is the time penalty due to adding checksums? 


d. [10] <6.2, 64> Again assuming that the data protection information is kept 
separate as in part (c), now assume you have to read 10,000 random blocks of 
4 KB each from a very large file (much bigger than 10,000 blocks, that is). 
For each read, you must again use the checksum to ensure data integrity. How 
long will it take to read the 10,000 blocks from disk, again assuming the same 
disk characteristics? What is the time penalty due to adding checksums? 


[40] <6.2, 6.3, 6.4> Finally, we put theory into practice by developing a user- 
level tool to guard against file corruption. Assume you are to write a simple set of 
tools to detect and repair data integrity. The first tool is used to checksums and 
parity. It should be called bui 1 d and used like this: 


build <filename> 


The build program should then store the needed checksum and redundancy 
information for the file f i 1 ename in a file in the same directory called . f i 1 e- 
name. cp (so it is easy to find later). 


A second program is then used to check and potentially repair damaged files. It 
should be called repai r and used like this: 


repair <filename> 


The repair program should consult the . cp file for the filename in question and 
verify that all the stored checksums match the computed checksums for the data. 
If the checksums don't match for a single block, repair should use the redun- 
dant information to reconstruct the correct data and fix the file. However, if two 
or more blocks are bad, repair should simply report that the file has been cor- 
rupted beyond repair. To test your system, we will provide a tool to corrupt files 
called corrupt. It works as follows: 


corrupt <filename> <blocknumber> 


All corrupt does is fill the specified block number of the file with random noise. 
For checksums you will be using MD5. MD5 takes an input string and gives you 
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a 128-bit "fingerprint" or checksum as an output. A great and simple implementa- 
tion of MDS is available here: 


http://sourceforge.net/proj ect/showfiles.php?group_i d=42360 


Parity is computed with the XOR operator. In C code, you can compute the parity 
of two blocks, each of size BLOCKSIZE, as follows: 


unsigned char blockI[BLOCKSIZE]; 
unsigned char block2[BLOCKSIZE]; 


unsigned char parity[BLOCKSIZE]; 


// first, clear parity block 


for (int i = 0; i < BLOCKSIZE; i++) 
parity[i] = 0; 
// then compute parity; carat symbol does XOR in C 
for (int i = 0; i < BLOCKSIZE; i++) { 
parity[i] = blockl[i] " block2[i]; 
} 


Case Study 7: Sorting Things Out 


Concepts illustrated by this case study 


m Benchmarking 

e Performance Analysis 

e Cost/Performance Analysis 
e Amortization of Overhead 


e Balanced Systems 

The database field has a long history of using benchmarks to compare systems. In 
this question, you will explore one of the benchmarks introduced by Anonymous 
et al. [1985] (see Chapter 1): external, or disk-to-disk, sorting. 

Sorting is an exciting benchmark for a number of reasons. First, sorting exer- 
cises a computer system across all its components, including disk, memory, and 
processors. Second, sorting at the highest possible performance requires a great 
deal of expertise about how the CPU caches, operating systems, and I/O sub- 
systems work. Third, it is simple enough to be implemented by a student (see 
below!). 

Depending on how much data you have, sorting can be done in one or multi- 
ple passes. Simply put, if you have enough memory to hold the entire data set in 
memory, you can read the entire data set into memory, sort it, and then write it 
out; this is called a "one-pass" sort. 

If you do not have enough memory, you must sort the data in multiple passes. 
There are many different approaches possible. One simple approach is to sort 
each chunk of the input file and write it to disk; this leaves (input file size)/(mem- 
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ory size) sorted files on disk. Then, you have to merge each sorted temporary file 
into a final sorted output. This is called a "two-pass" sort. More passes are needed 
in the unlikely case that you cannot merge all the streams in the second pass. 

In this case study you will analyze various aspects of sorting, determining its 
effectiveness and cost-effectiveness in different scenarios. You will also write 
your own version of an external sort, measuring its performance on real hard- 
ware. 


[20/20/20] <6.4> We will start by configuring a system to complete a sort in the 
least possible time, with no limits on how much we can spend. To get peak band- 
width from the sort, we have to make sure all the paths through the system have 
sufficient bandwidth. 


Assume for simplicity that the time to perform the in-memory sort of keys is lin- 
early proportional to the CPU rate and memory bandwidth of the given machine 
(e.g., sorting 1 MB of records on a machine with 1 MB/sec of memory bandwidth 
and a | MIPS processor will take 1 second). Assume further that you have care- 
fully written the I/O phases of the sort so as to achieve sequential bandwidth. And 
of course realize that if you don't have enough memory to hold all of the data at 
once that sort will take two passes. 


One problem you may encounter in performing I/O is that systems often perform 
extra memory copies; for example, when the read () system call is invoked, data 
may first be read from disk into a system buffer, and then subsequently copied 
into the specified user buffer. Hence, memory bandwidth during I/O can be an 
issue. 


Finally, for simplicity, assume that there is no overlap of reading, sorting, or writ- 
ing. That is, when you are reading data from disk, that is all you are doing; when 
sorting, you are just using the CPU and memory bandwidth; when writing, you 
are just writing data to disk. 


Your job in this task is to configure a system to extract peak performance when 
sorting 1 GB of data (i.e., roughly 10 million 100-byte records). Use the follow- 
ing table to make choices about which machine, memory, I/O interconnect, and 
disks to buy. 





CPU VO interconnect 

Slow 1GIPS $200 Slow 80 MB/sec $50 
Standard 2 GIPS $1000 Standard 160 MB/sec $100 
Fast 4 GIPS $2000 Fast 320 MB/sec $400 
Memory Disks 

Slow 512 MB/sec $100/GB Slow 30 MB/sec $70 
Standard 1 GB/sec $200/GB Standard 60 MB/sec $120 
Fast 2 GB/sec $500/GB Fast 110 MB/sec $300 
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Note: Assume you are buying a single-processor system, and that you can have 
up to two I/O interconnects. However, the amount of memory and number of 
disks is up to you (assume there is no limit on disks per I/O interconnect). 


a. [20] <6.4> What is the total cost of your machine? (Break this down by part, 
including the cost of the CPU, amount of memory, number of disks, and I/O 
bus.) 


b. [20] <6.4> How much time does it take to complete the sort of 1 GB worth of 
records? (Break this down into time spent doing reads from disk, writes to 
disk, and time spent sorting.) 


c. [20] <6.4> What is the bottleneck in your system? 


[25/25/25] <6.4> We will now examine cost-performance issues in sorting. After 
all, it is easy to buy a high-performing machine; it is much harder to buy a cost- 
effective one. 


One place where this issue arises is with the PennySort competition 
(research.microsoft.com/barc/SortBenchmark/). PennySort asks that you sort as 
many records as you can for a single penny. To compute this, you should assume 
that a system you buy will last for 3 years (94,608,000 seconds), and divide this 
by the total cost in pennies of the machine. The result is your time budget per 
penny. 

Our task here will be a little simpler. Assume you have a fixed budget of $2000 
(or less). What is the fastest sorting machine you can build? Use the same hard- 
ware table as in Exercise 6.28 to configure the winning machine. 





{Hint: You might want to write a little computer program to generate all the pos- 
sible configurations.) 


a. [25] <6.4> What is the total cost of your machine? (Break this down by part, 
including the cost of the CPU, amount of memory, number of disks, and I/O 
bus.) 


b. [25] <6.4> How does the reading, writing, and sorting time break down with 
this configuration? 


c. [25] <6.4> What is the bottleneck in your system? 


[20/20/20] <6.4, 6.6> Getting good disk performance often requires amortization 
of overhead. The idea is simple: if you must incur an overhead of some kind, do 
as much useful work as possible after paying the cost, and hence reduce its 
impact. This idea is quite general and can be applied to many areas of computer 
systems; with disks, it arises with the seek and rotational costs (overheads) that 
you must incur before transferring data. You can amortize an expensive seek and 
rotation by transferring a large amount of data. 


In this exercise, we focus on how to amortize seek and rotational costs during the 
second pass of a two-pass sort. Assume that when the second pass begins, there 
are N sorted runs on the disk, each of a size that fits within main memory. Our 
task here is to read in a chunk from each sorted run and merge the results into a 
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final sorted output. Note that a read from one run will incur a seek and rotation, 
as it is very likely that the last read was from a different run. 


a. [20] <6.4, 6.6> Assume that you have a disk that can transfer at 100 MB/sec, 
with an average seek cost of 7 ms, and a rotational rate of 10,000 RPM. 
Assume further that every time you read from a run, you read 1 MB of data, 
and that there are 100 runs each of size 1 GB. Also assume that writes (to the 
final sorted output) take place in large 1 GB chunks. How long will the merge 
phase take, assuming I/O is the dominant (i.e., only) cost? 


b. [20] <6.4, 6.6> Now assume that you change the read size from 1 MB to 10 
MB. How is the total time to perform the second pass of the sort affected? 


c. [20] <6.4, 6.6> In both cases, assume that what we wish to maximize is disk 
efficiency. We compute disk efficiency as the ratio of the time spent transfer- 
ring data over the total time spent accessing the disk. What is the disk effi- 
ciency in each of the scenarios mentioned above? 


[40] <6.2, 6.4, 6.6> In this exercise, you will write your own external sort. To 
generate the data set, we provide a tool generate that works as follows: 


generate <filename> <size (in MB)> 


By running generate, you create a file named fi 1 ename of size si ze MB. The 
file consists of 100 byte keys, with 10-byte records (the part that must be sorted). 


We also provide a tool called check that checks whether a given input file is 
sorted or not. It is run as follows: 


check <filename> 


The basic one-pass sort does the following: reads in the data, sorts it, and then 
writes it out. However, numerous optimizations are available to you: overlapping 
reading and sorting, separating keys from the rest of the record for better cache 
behavior and hence faster sorting, overlapping sorting and writing, and so forth. 
Neuberg et al. [1994] is a terrific place to look for some hints. 


One important rule is that data must always start on disk (and not in the file 
system cache. The easiest way to ensure this is to unmount and remount the file 
system. 


One goal: beat the Datamation sort record. Currently, the record for sorting 1 mil- 
lion 100-byte records is 0.44 seconds, which was obtained on a cluster of 32 
machines. If you are careful, you might be able to beat this on a single PC config- 
ured with a few disks. 
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A.1 


Introduction 


Many readers of this text will have covered the basics of pipelining in another 
text (such as our more basic text Computer Organization and Design) or in 
another course. Because Chapters 2 and 3 build heavily on this material, readers 
should ensure that they are familiar with the concepts discussed in this appendix 
before proceeding. As you read Chapter 2, you may find it helpful to turn to this 
material for a quick review. 

We begin the appendix with the basics of pipelining, including discussing the 
data path implications, introducing hazards, and examining the performance of 
pipelines. This section describes the basic five-stage RISC pipeline that is the 
basis for the rest of the appendix. Section A.2 describes the issue of hazards, why 
they cause performance problems and how they can be dealt with. Section A.3 
discusses how the simple five-stage pipeline is actually implemented, focusing on 
control and how hazards are dealt with. 

Section A.4 discusses the interaction between pipelining and various aspects 
of instruction set design, including discussing the important topic of exceptions 
and their interaction with pipelining. Readers unfamiliar with the concepts of 
precise and imprecise interrupts and resumption after exceptions will find this 
material useful, since they are key to understanding the more advanced 
approaches in Chapter 2. 

Section A.5 discusses how the five-stage pipeline can be extended to handle 
longer-running floating-point instructions. Section A.6 puts these concepts 
together in a case study of a deeply pipelined processor, the MIPS R4000/4400, 
including both the eight-stage integer pipeline and the floating-point pipeline. 

Section A.7 introduces the concept of dynamic scheduling and the use of 
scoreboards to implement dynamic scheduling. It is introduced as a crosscutting 
issue, since it can be used to serve as an introduction to the core concepts in 
Chapter 2, which focused on dynamically scheduled approaches. Section A.7 is 
also a gentle introduction to the more complex Tomasulo's algorithm covered in 
Chapter 2. Although Tomasulo's algorithm can be covered and understood with- 
out introducing scoreboarding, the scoreboarding approach is simpler and easier 
to comprehend. 


What Is Pipelining? 


Pipelining is an implementation technique whereby multiple instructions are 
overlapped in execution; it takes advantage of parallelism that exists among the 
actions needed to execute an instruction. Today, pipelining is the key implemen- 
tation technique used to make fast CPUs. 

A pipeline is like an assembly line. In an automobile assembly line, there are 
many steps, each contributing something to the construction of the car. Each step 
operates in parallel with the other steps, although on a different car. In a computer 
pipeline, each step in the pipeline completes a part of an instruction. Like the 
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assembly line, different steps are completing different parts of different instruc- 
tions in parallel. Each of these steps is called a pipe stage or a pipe segment. The 
stages are connected one to the next to form a pipe—instructions enter at one 
end, progress through the stages, and exit at the other end, just as cars would in 
an assembly line. 

In an automobile assembly line, throughput is defined as the number of cars 
per hour and is determined by how often a completed car exits the assembly line. 
Likewise, the throughput of an instruction pipeline is determined by how often an 
instruction exits the pipeline. Because the pipe stages are hooked together, all the 
stages must be ready to proceed at the same time, just as we would require in an 
assembly line. The time required between moving an instruction one step down 
the pipeline is & processor cycle. Because all stages proceed at the same time, the 
length of a processor cycle is determined by the time required for the slowest 
pipe stage, just as in an auto assembly line, the longest step would determine the 
time between advancing the line. In a computer, this processor cycle is usually 
1 clock cycle (sometimes it is 2, rarely more). 

The pipeline designer's goal is to balance the length of each pipeline stage, 
just as the designer of the assembly line tries to balance the time for each step in 
the process. If the stages are perfectly balanced, then the time per instruction on 
the pipelined processor—assuming ideal conditions—is equal to 


Time per instruction on unpipelined machine 
Number of pipe stages 





Under these conditions, the speedup from pipelining equals the number of pipe 
stages, just as an assembly line with n stages can ideally produce cars n times as 
fast. Usually, however, the stages will not be perfectly balanced; furthermore, 
pipelining does involve some overhead. Thus, the time per instruction on the 
pipelined processor will not have its minimum possible value, yet it can be close. 

Pipelining yields a reduction in the average execution time per instruction. 
Depending on what you consider as the baseline, the reduction can be viewed as 
decreasing the number of clock cycles per instruction (CPI), as decreasing the 
clock cycle time, or as a combination. If the starting point is a processor that 
takes multiple clock cycles per instruction, then pipelining is usually viewed as 
reducing the CPI. This is the primary view we will take. If the starting point is a 
processor that takes 1 (long) clock cycle per instruction, then pipelining 
decreases the clock cycle time. 

Pipelining is an implementation technique that exploits parallelism among 
the instructions in a sequential instruction stream. It has the substantial advantage 
that, unlike some speedup techniques (see Chapter 4), it is not visible to the pro- 
grammer. In this appendix we will first cover the concept of pipelining using a 
classic five-stage pipeline; other chapters investigate the more sophisticated 
pipelining techniques in use in modern processors. Before we say more about 
pipelining and its use in a processor, we need a simple instruction set, which we 
introduce next. 
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The Basics of a RISC Instruction Set 


Throughout this book we use a RISC (reduced instruction set computer) architec- 
ture or load-store architecture to illustrate the basic concepts, although nearly all 
the ideas we introduce in this book are applicable to other processors. In this sec- 
tion we introduce the core of a typical RISC architecture. In this appendix, and 
throughout the book, our default RISC architecture is MIPS. In many places, the 
concepts are significantly similar that they will apply to any RISC. RISC archi- 
tectures are characterized by a few key properties, which dramatically simplify 
their implementation: 


e All operations on data apply to data in registers and typically change the 
entire register (32 or 64 bits per register). 


e The only operations that affect memory are load and store operations that 
move data from memory to a register or to memory from a register, respec- 
tively. Load and store operations that load or store less than a full register 
(e.g., abyte, 16 bits, or 32 bits) are often available. 


e The instruction formats are few in number with all instructions typically 
being one size. 


These simple properties lead to dramatic simplifications in the implementation of 
pipelining, which is why these instruction sets were designed this way. 

For consistency with the rest of the text, we use MIPS64, the 64-bit version 
of the MIPS instruction set. The extended 64-bit instructions are generally desig- 
nated by having a D on the start or end of the mnemonic. For example DADD is the 
64-bit version of an add instruction, while LD is the 64-bit version of a load 
instruction. 

Like other RISC architectures, the MIPS instruction set provides 32 registers, 
although register 0 always has the value 0. Most RISC architectures, like MIPS, 
have three classes of instructions (see Appendix B for more detail): 


1. ALU instructions—These instructions take either two registers or a register 
and a sign-extended immediate (called ALU immediate instructions, they 
have a 16-bit offset in MIPS), operate on them, and store the result into a 
third register. Typical operations include add (DADD), subtract (DSUB), and log- 
ical operations (such as AND or OR), which do not differentiate between 32-bit 
and 64-bit versions. Immediate versions of these instructions use the same 
mnemonics with a suffix of I. In MIPS, there are both signed and unsigned 
forms of the arithmetic instructions; the unsigned forms, which do not gener- 
ate overflow exceptions—and thus are the same in 32-bit and 64-bit mode— 
have a U at the end (e.g., DADDU, DSUBU, DADDIU). 


2. Load and store instructions—These instructions take a register source, called 
the base register, and an immediate field (16-bit in MIPS), called the offset, as 
operands. The sum—called the effective address—of the contents of the base 
register and the sign-extended offset is used as a memory address. In the case 
of a load instruction, a second register operand acts as the destination for the 
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data loaded from memory. In the case of a store, the second register operand 
is the source of the data that is stored into memory. The instructions load 
word (LD) and store word (SD) load or store the entire 64-bit register contents. 


3. Branches and jumps—Branches are conditional transfers of control. There 
are usually two ways of specifying the branch condition in RISC architec- 
tures: with a set of condition bits (sometimes called a condition code) or by a 
limited set of comparisons between a pair of registers or between a register 
and zero. MIPS uses the latter. For this appendix, we consider only compari- 
sons for equality between two registers. In all RISC architectures, the branch 
destination is obtained by adding a sign-extended offset (16 bits in MIPS) to 
the current PC. Unconditional jumps are provided in many RISC architec- 
tures, but we will not cover jumps in this appendix. 


A Simple Implementation of a RISC Instruction Set 


To understand how a RISC instruction set can be implemented in a pipelined 
fashion, we need to understand how it is implemented without pipelining. This 
section shows a simple implementation where every instruction takes at most 5 
clock cycles. We will extend this basic implementation to a pipelined version, 
resulting in a much lower CPI. Our unpipelined implementation is not the most 
economical or the highest-performance implementation without pipelining. 
Instead, it is designed to lead naturally to a pipelined implementation. Imple- 
menting the instruction set requires the introduction of several temporary regis- 
ters that are not part of the architecture; these are introduced in this section to 
simplify pipelining. Our implementation will focus only on a pipeline for an inte- 
ger subset of a RISC architecture that consists of load-store word, branch, and 
integer ALU operations. 

Every instruction in this RISC subset can be implemented in at most 5 clock 
cycles. The 5 clock cycles are as follows. 


1. Instruction fetch cycle (IF): 


Send the program counter (PC) to memory and fetch the current instruction 
from memory. Update the PC to the next sequential PC by adding 4 (since 
each instruction is 4 bytes) to the PC. 


2. Instruction decode/register fetch cycle (ID): 


Decode the instruction and read the registers corresponding to register 
source specifiers from the register file. Do the equality test on the registers 
as they are read, for a possible branch. Sign-extend the offset field of the 
instruction in case it is needed. Compute the possible branch target address 
by adding the sign-extended offset to the incremented PC. In an aggressive 
implementation, which we explore later, the branch can be completed at the 
end of this stage, by storing the branch-target address into the PC, if the 
condition test yielded true. 

Decoding is done in parallel with reading registers, which is possible 
because the register specifiers are at a fixed location in a RISC architecture. 
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4. 


This technique is known as fixed-field decoding. Note that we may read a 
register we don't use, which doesn't help but also doesn't hurt performance. 
(It does waste energy to read an unneeded register, and power-sensitive 
designs might avoid this.) Because the immediate portion of an instruction 
is also located in an identical place, the sign-extended immediate is also cal- 
culated during this cycle in case it is needed. 


Execution/effective address cycle (EX): 


The ALU operates on the operands prepared in the prior cycle, performing 
one of three functions depending on the instruction type. 


e Memory reference: The ALU adds the base register and the offset to form 
the effective address. 


e Register-Register ALU instruction: The ALU performs the operation 
specified by the ALU opcode on the values read from the register file. 


e Register-Immediate ALU instruction: The ALU performs the operation 
specified by the ALU opcode on the first value read from the register file 
and the sign-extended immediate. 


In a load-store architecture the effective address and execution cycles 
can be combined into a single clock cycle, since no instruction needs to 
simultaneously calculate a data address and perform an operation on the 
data. 


Memory access (MEM): 


If the instruction is a load, memory does a read using the effective address 
computed in the previous cycle. If it is a store, then the memory writes the 
data from the second register read from the register file using the effective 
address. 


Write-back cycle (WB): 
e Register-Register ALU instruction or Load instruction: 
Write the result into the register file, whether it comes from the memory 


system (for a load) or from the ALU (for an ALU instruction). 


In this implementation, branch instructions require 2 cycles, store instructions 


require 4 cycles, and all other instructions require 5 cycles. Assuming a branch 
frequency of 12% and a store frequency of 10%, a typical instruction distribution 
leads to an overall CPI of 4.54. This implementation, however, is not optimal 
either in achieving the best performance or in using the minimal amount of hard- 
ware given the performance level; we leave the improvement of this design as an 
exercise for you and instead focus on pipelining this version. 


The Classic Five-Stage Pipeline for a RISC Processor 


We can pipeline the execution described above with almost no changes by simply 
starting a new instruction on each clock cycle. (See why we chose this design!) 
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Each of the clock cycles from the previous section becomes a pipe stage—a cycle 
in the pipeline. This results in the execution pattern shown in Figure A.l, which 
is the typical way a pipeline structure is drawn. Although each instruction takes 5 
clock cycles to complete, during each clock cycle the hardware will initiate a new 
instruction and will be executing some part of the five different instructions. 

You may find it hard to believe that pipelining is as simple as this; it's not. In 
this and the following sections, we will make our RISC pipeline "real" by dealing 
with problems that pipelining introduces. 

To start with, we have to determine what happens on every clock cycle of the 
processor and make sure we don't try to perform two different operations with 
the same data path resource on the same clock cycle. For example, a single ALU 
cannot be asked to compute an effective address and perform a subtract operation 
at the same time. Thus, we must ensure that the overlap of instructions in the 
pipeline cannot cause such a conflict. Fortunately, the simplicity of a RISC 
instruction set makes resource evaluation relatively easy. Figure A.2 shows a 
simplified version of a RISC data path drawn in pipeline fashion. As you can see, 
the major functional units are used in different cycles, and hence overlapping the 
execution of multiple instructions introduces relatively few conflicts. There are 
three observations on which this fact rests. 

First, we use separate instruction and data memories, which we would typi- 
cally implement with separate instruction and data caches (discussed in Chapter 
5). The use of separate caches eliminates a conflict for a single memory that 
would arise between instruction fetch and data memory access. Notice that if our 
pipelined processor has a clock cycle that is equal to that of the unpipelined ver- 
sion, the memory system must deliver five times the bandwidth. This increased 
demand is one cost of higher performance. 

Second, the register file is used in the two stages: one for reading in ID and 
one for writing in WB. These uses are distinct, so we simply show the register file 
in two places. Hence, we need to perform two reads and one write every clock 
cycle. To handle reads and a write to the same register (and for another reason, 


Clock number 

















Instruction number 1 2 3 4 5 6 7 8 9 
Instruction; IF ID EX MEM WB 

Instruction i + \ IF ID EX MEM WB 

Instruction i + 2 IF ID EX MEM WB 

Instruction i + 3 IF ID EX MEM WB 
Instruction i + 4 IF ID EX MEM WB 





Figure A.I Simple RISC pipeline. On each clock cycle, another instruction is fetched and begins its 5-cycle execu- 
tion. If an instruction is started every clock cycle, the performance will be up to five times that of a processor that is 
not pipelined.The names for the stages in the pipeline are the same as those used for the cycles in the unpipelined 
implementation: IF = instruction fetch, ID = instruction decode, EX = execution, MEM = memory access, and WB = 
write back. 
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Program execution order (in instructions) 
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Time (in clock cycles) 


cc 1 


cC2 











Gos i ce4 d Os cCé6 : C7 > CCB i ecg 


Figure A.2 The pipeline can be thought of as a series of data paths shifted in time. This shows the overlap among 
the parts of the data path, with clock cycle 5 (CC 5) showing the steady-state situation. Because the register file is 
used as a source in the ID stage and as a destination in the WB stage, it appears twice. We show that it is read in one 
part of the stage and written in another by using a solid line, on the right or left, respectively, and a dashed line on 
the other side.The abbreviation IM is used for instruction memory, DM for data memory, and CC for clock cycle. 


which will become obvious shortly), we perform the register write in the first half 
of the clock cycle and the read in the second half. 

Third, Figure A.2 does not deal with the PC. To start a new instruction every 
clock, we must increment and store the PC every clock, and this must be done 
during the IF stage in preparation for the next instruction. Furthermore, we must 
also have an adder to compute the potential branch target during ID. One further 
problem is that a branch does not change the PC until the ID stage. This causes a 
problem, which we ignore for now, but will handle shortly. 

Although it is critical to ensure that instructions in the pipeline do not attempt 
to use the hardware resources at the same time, we must also ensure that instruc- 
tions in different stages of the pipeline do not interfere with one another. This 
separation is done by introducing pipeline registers between successive stages of 
the pipeline, so that at the end of a clock cycle all the results from a given stage 
are stored into a register that is used as the input to the next stage on the next 
clock cycle. Figure A.3 shows the pipeline drawn with these pipeline registers. 
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Time (in clock cycles) 


CC 1 ccC2 cc3 CcC4 cC5 cC 6 











Figure A.3 A pipeline showing the pipeline registers between successive pipeline stages. Notice that the regis- 
ters prevent interference between two different instructions in adjacent stages in the pipeline.The registers also play 
the critical role of carrying data for a given instruction from one stage to the other.The edge-triggered property of 
registers—that is, that the values change instantaneously on a clock edge—is critical. Otherwise, the data from one 
instruction could interfere with the execution of another! 


Although many figures will omit such registers for simplicity, they are 
required to make the pipeline operate properly and must be present. Of course, 
similar registers would be needed even in a multicycle data path that had no pipe- 
lining (since only values in registers are preserved across clock boundaries). In 
the case of a pipelined processor, the pipeline registers also play the key role of 
carrying intermediate results from one stage to another where the source and des- 
tination may not be directly adjacent. For example, the register value to be stored 
during a store instruction is read during ID, but not actually used until MEM; it is 
passed through two pipeline registers to reach the data memory during the MEM 
stage. Likewise, the result of an ALU instruction is computed during EX, but not 
actually stored until WB; it arrives there by passing through two pipeline regis- 
ters. It is sometimes useful to name the pipeline registers, and we follow the 
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Example 


Answer 


convention of naming them by the pipeline stages they connect, so that the regis- 
ters are called IF/ID, ID/EX, EX/MEM, and MEM/WB. 


Basic Performance Issues in Pipelining 


Pipelining increases the CPU instruction throughput—the number of instructions 
completed per unit of time—but it does not reduce the execution time of an indi- 
vidual instruction. In fact, it usually slightly increases the execution time of each 
instruction due to overhead in the control of the pipeline. The increase in instruc- 
tion throughput means that a program runs faster and has lower total execution 
time, even though no single instruction runs faster! 

The fact that the execution time of each instruction does not decrease puts 
limits on the practical depth of a pipeline, as we will see in the next section. In 
addition to limitations arising from pipeline latency, limits arise from imbalance 
among the pipe stages and from pipelining overhead. Imbalance among the pipe 
stages reduces performance since the clock can run no faster than the time needed 
for the slowest pipeline stage. Pipeline overhead arises from the combination of 
pipeline register delay and clock skew. The pipeline registers add setup time, 
which is the time that a register input must be stable before the clock signal that 
triggers a write occurs, plus propagation delay to the clock cycle. Clock skew, 
which is maximum delay between when the clock arrives at any two registers, 
also contributes to the lower limit on the clock cycle. Once the clock cycle is as 
small as the sum of the clock skew and latch overhead, no further pipelining is 
useful, since there is no time left in the cycle for useful work. The interested 
reader should see Kunkel and Smith [1986]. As we will see in Chapter 2, this 
overhead affected the performance gains achieved by the Pentium 4 versus the 
Pentium III. 


Consider the unpipelined processor in the previous section. Assume that it has a 1 
ns clock cycle and that it uses 4 cycles for ALU operations and branches and 5 
cycles for memory operations. Assume that the relative frequencies of these oper- 
ations are 40%, 20%, and 40%, respectively. Suppose that due to clock skew and 
setup, pipelining the processor adds 0.2 ns of overhead to the clock. Ignoring any 
latency impact, how much speedup in the instruction execution rate will we gain 
from a pipeline? 


The average instruction execution time on the unpipelined processor is 


Average instruction execution time = Clock cycle x Average CPI 
= 1 ns x (40% + 20%) x 4 + 40% x 5) 
=Insx44 
= 44 ns 


In the pipelined implementation, the clock must run at the speed of the slowest 
stage plus overhead, which will be 1 + 0.2 or 1.2 ns; this is the average instruction 
execution time. Thus, the speedup from pipelining is 
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Average instruction time unpipelined 
Average instruction time pipelined 
4.4 ns 


= Sona 3.7 times 


Speedup from pipelining 


The 0.2 ns overhead essentially establishes a limit on the effectiveness of pipelin- 
ing. If the overhead is not affected by changes in the clock cycle, Amdahl's Law 
tells us that the overhead limits the speedup. 


This simple RISC pipeline would function just fine for integer instructions if 
every instruction were independent of every other instruction in the pipeline. In 
reality, instructions in the pipeline can depend on one another; this is the topic of 
the next section. 


The Major Hurdle of Pipelining—Pipeline Hazards 


There are situations, called hazards, that prevent the next instruction in the 
instruction stream from executing during its designated clock cycle. Hazards 
reduce the performance from the ideal speedup gained by pipelining. There are 
three classes of hazards: 


1. Structural hazards arise from resource conflicts when the hardware cannot 
support all possible combinations of instructions simultaneously in over- 
lapped execution. 


2. Data hazards arise when an instruction depends on the results of a previous 
instruction in a way that is exposed by the overlapping of instructions in the 
pipeline. 

3. Control hazards arise from the pipelining of branches and other instructions 
that change the PC. 


Hazards in pipelines can make it necessary to stall the pipeline. Avoiding a 
hazard often requires that some instructions in the pipeline be allowed to proceed 
while others are delayed. For the pipelines we discuss in this appendix, when an 
instruction is stalled, all instructions issued later than the stalled instruction—and 
hence not as far along in the pipeline—are also stalled. Instructions issued earlier 
than the stalled instruction—and hence farther along in the pipeline—must con- 
tinue, since otherwise the hazard will never clear. As a result, no new instructions 
are fetched during the stall. We will see several examples of how pipeline stalls 
operate in this section—don't worry, they aren't as complex as they might sound! 


Performance of Pipelines with Stalls 


A stall causes the pipeline performance to degrade from the ideal performance 
Let's look at a simple equation for finding the actual speedup from pipelining 
starting with the formula from the previous section. 
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s -aano — Average instruction time unpipelined 
peedup from pipelining = : T A 
Average instruction time pipelined 


_ CPI unpipelined x Clock cycle unpipelined 
CPI pipelined x Clock cycle pipelined 


E CPI unpipelined z Clock cycle unpipelined 
CPI pipelined Clock cycle pipelined 





Pipelining can be thought of as decreasing the CPI or the clock cycle time. Since 
it is traditional to use the CPI to compare pipelines, let's start with that assump- 
tion. The ideal CPI on a pipelined processor is almost always 1. Hence, we can 
compute the pipelined CPI: 


CPI pipelined = Ideal CPI + Pipeline stall clock cycles per instruction 
= | + Pipeline stall clock cycles per instruction 


If we ignore the cycle time overhead of pipelining and assume the stages are per- 
fectly balanced, then the cycle time of the two processors can be equal, leading to 


CPI unpipelined 
1 + Pipeline stall cycles „per instruction 





Speedup = 


One important simple case is where all instructions take the same number of 
cycles, which must also equal the number of pipeline stages (also called the depth 
of the pipeline). In this case, the unpipelined CPI is equal to the depth of the pipe- 
line, leading to 


Pipeline depth 


Speedup = 
E p 1 + Pipeline stall cycles per instruction 





If there are no pipeline stalls, this leads to the intuitive result that pipelining can 
improve performance by the depth of the pipeline. 

Alternatively, if we think of pipelining as improving the clock cycle time, 
then we can assume that the CPI of the unpipelined processor, as well as that of 
the pipelined processor, is 1. This leads to 


CPI unpipelined | Clock cycle unpipelined 


Speedup from pipelinin 
ee as s CPI pipelined Clock cycle pipelined 


1 y Clock cycle unpipelined 
1 + Pipeline stall cycles per instruction Clock cycle pipelined 





In cases where the pipe stages are perfectly balanced and there is no overhead, 
the clock cycle on the pipelined processor is smaller than the clock cycle of the 
unpipelined processor by a factor equal to the pipelined depth: 


Clock cycle unpipelined 
Pipeline depth 


Clock cycle pipelined 


Clock cycle unpipelined 
Clock cycle pipelined 





Pipeline depth 
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This leads to the following: 





Speedup from pipelining = L x Clock cycle unpipelined 


Example 


1 + Pipeline stall cycles per instruction Clock cycle pipelined 
1 
1 + Pipeline stall cycles per instruction 





x Pipeline depth 


Thus, if there are no stalls, the speedup is equal to the number of pipeline stages, 
matching our intuition for the ideal case. 


Structural Hazards 


When a processor is pipelined, the overlapped execution of instructions requires 
pipelining of functional units and duplication of resources to allow all possible 
combinations of instructions in the pipeline. If some combination of instructions 
cannot be accommodated because of resource conflicts, the processor is said to 
have a structural hazard. 

The most common instances of structural hazards arise when some functional 
unit is not fully pipelined. Then a sequence of instructions using that unpipelined 
unit cannot proceed at the rate of one per clock cycle. Another common way that 
structural hazards appear is when some resource has not been duplicated enough 
to allow all combinations of instructions in the pipeline to execute. For example, 
a processor may have only one register-file write port, but under certain circum- 
stances, the pipeline might want to perform two writes in a clock cycle. This will 
generate a structural hazard. 

When a sequence of instructions encounters this hazard, the pipeline will stall 
one of the instructions until the required unit is available. Such stalls will increase 
the CPI from its usual ideal value of 1. 

Some pipelined processors have shared a single-memory pipeline for data 
and instructions. As a result, when an instruction contains a data memory refer- 
ence, it will conflict with the instruction reference for a later instruction, as 
shown in Figure A.4. To resolve this hazard, we stall the pipeline for 1 clock 
cycle when the data memory access occurs. A stall is commonly called a. pipe- 
line bubble or just bubble, since it floats through the pipeline taking space but 
carrying no useful work. We will see another type of stall when we talk about 
data hazards. 

Designers often indicate stall behavior using a simple diagram with only the 
pipe stage names, as in Figure A.5. The form of Figure A.5 shows the stall by 
indicating the cycle when no action occurs and simply shifting instruction 3 to 
the right (which delays its execution start and finish by 1 cycle). The effect of the 
pipeline bubble is actually to occupy the resources for that instruction slot as it 
travels through the pipeline. 


Let's see how much the load structural hazard might cost. Suppose that data ref- 
erences constitute 40% of the mix, and that the ideal CPI of the pipelined proces- 
sor, ignoring the structural hazard, is 1. Assume that the processor with the 
structural hazard has a clock rate that is 1.05 times higher than the clock rate of 


A-14 Appendix A Pipelining: Basic and Intermediate Concepts 


Time (in clock cycles) + 


i i wee | We lt caw : Bs | Ge fe | ccs 


Load 


Instruction 1 


Instruction 3 


Instruction ' 





Figure A.4 A processor with only one memory port will generate a conflict whenever a memory reference 
occurs. In this example the load instruction uses the memory for a data access at the same time instruction 3 wants 
to fetch an instruction from memory. 


the processor without the hazard. Disregarding any other performance losses, is 
the pipeline with or without the structural hazard faster, and by how much? 


Answer There are several ways we could solve this problem. Perhaps the simplest is to 
compute the average instruction time on the two processors: 


Average instruction time = CPI x Clock cycle time 


Since it has no stalls, the average instruction time for the ideal processor is sim- 
ply the Clock cycle timeigea. The average instruction time for the processor with 
the structural hazard is 


Average instruction time = CPI x Clock cycle time 

Clock cycle timeideai 
1.05 

13 x-Clock. eyele - timeiceat 


(1 + 0.4x1)x 


A.2 The Major Hurdle of Pipelining—Pipeline Hazards A-15 





Clock cycle number 























Instruction 2 3 4 5 6 7 8 9 10 
Load instruction ID EX MEM WB 

Instruction i+\ IF ID EX MEM WB 

Instruction i + 2 IF ID EX MEM WB 

Instruction i + 3 stall IF ID EX MEM WB 
Instruction (' +4 IF ID EX MEM WB 
Instruction i + 5 IF ID EX MEM 
Instruction i + 6 IF ID EX 





Figure A.5 A pipeline stalled for a structural hazard—a load with one memory port. As shown here, the load 
instruction effectively steals an instruction-fetch cycle, causing the pipeline to stall—no instruction is initiated on 
clock cycle 4 (which normally would initiate instruction / + 3). Because the instruction being fetched is stalled, all 
other instructions in the pipeline before the stalled instruction can proceed normally.The stall cycle will continue to 
pass through the pipeline, so that no instruction completes on clock cycle 8. Sometimes these pipeline diagrams are 
drawn with the stall occupying an entire horizontal row and instruction 3 being moved to the next row; in either 
case, the effect is the same, since instruction / + 3 does not begin execution until cycle 5. We use the form above, 
since it takes less space in the figure. Note that this figure assumes that instruction /+ 1 and / + 2 are not memory 


references. 


Clearly, the processor without the structural hazard is faster; we can use the ratio 
of the average instruction times to conclude that the processor without the hazard 
is 1.3 times faster. 

As an alternative to this structural hazard, the designer could provide a sepa- 
rate memory access for instructions, either by splitting the cache into separate 
instruction and data caches, or by using a set of buffers, usually called instruction 
buffers, to hold instructions. Chapter 5 discusses both the split cache and instruc- 
tion buffer ideas. 


If all other factors are equal, a processor without structural hazards will 
always have a lower CPI. Why, then, would a designer allow structural hazards? 
The primary reason is to reduce cost of the unit, since pipelining all the func- 
tional units, or duplicating them, may be too costly. For example, processors that 
support both an instruction and a data cache access every cycle (to prevent the 
structural hazard of the above example) require twice as much total memory 
bandwidth and often have higher bandwidth at the pins. Likewise, fully pipelin- 
ing a floating-point multiplier consumes lots of gates. If the structural hazard is 
rare, it may not be worth the cost to avoid it. 


Data Hazards 


A major effect of pipelining is to change the relative timing of instructions by 
overlapping their execution. This overlap introduces data and control hazards. 
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Data hazards occur when the pipeline changes the order of read/write accesses to 
operands so that the order differs from the order seen by sequentially executing 
instructions on an unpipelined processor. Consider the pipelined execution of 
these instructions: 


DADD R1,R2,R3 
DSUB R4.R1.R5 
AND R6,R1,R7 
CR R8.R1.R9 
XCR R10.R1.R11 


All the instructions after the DADD use the result of the DADD instruction. As shown 
in Figure A.6, the DAD instruction writes the value of RI in the WB pipe stage, 
but the DSUB instruction reads the value during its ID stage. This problem is 
called a data hazard. Unless precautions are taken to prevent it, the DSUB instruc- 
tion will read the wrong value and try to use it. In fact, the value used by the DSUB 
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Figure A.6 The use of the result of the DADD instruction in the next three instructions causes a hazard, since the 
register is not written until after those instructions read it. 
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instruction is not even deterministic: Though we might think it logical to assume 
that DSUB would always use the value of RI that was assigned by an instruction 
prior to DADD, this is not always the case. If an interrupt should occur between the 
DAD and DSUB instructions, the WB stage of the DAD will complete, and the 
value of RI at that point will be the result of the DADD This unpredictable 
behavior is obviously unacceptable. 

The AND instruction is also affected by this hazard. As we can see from 
Figure A.6, the write of RI does not complete until the end of clock cycle 5. 
Thus, the AND instruction that reads the registers during clock cycle 4 will receive 
the wrong results. 

The XR instruction operates properly because its register read occurs in 
clock cycle 6, after the register write. The OR instruction also operates without 
incurring a hazard because we perform the register file reads in the second half of 
the cycle and the writes in the first half. 

The next subsection discusses a technique to eliminate the stalls for the haz- 
ard involving the DSUB and AND instructions. 


Minimizing Data Hazard Stalls by Forwarding 


The problem posed in Figure A.6 can be solved with a simple hardware tech- 
nique called forwarding (also called bypassing and sometimes short-circuiting). 
The key insight in forwarding is that the result is not really needed by the DSUB 
until after the DAD actually produces it. If the result can be moved from the pipe- 
line register where the DADD stores it to where the DSUB needs it, then the need for 
a stall can be avoided. Using this observation, forwarding works as follows: 


1. The ALU result from both the EX/MEM and MEM/WB pipeline registers is 
always fed back to the ALU inputs. 


2. Ifthe forwarding hardware detects that the previous ALU operation has writ- 
ten the register corresponding to a source for the current ALU operation, con- 
trol logic selects the forwarded result as the ALU input rather than the value 
read from the register file. 


Notice that with forwarding, if the DSUB is stalled, the DAD will be completed 
and the bypass will not be activated. This relationship is also true for the case of 
an interrupt between the two instructions. 

As the example in Figure A.6 shows, we need to forward results not only 
from the immediately previous instruction, but possibly from an instruction that 
started 2 cycles earlier. Figure A.7 shows our example with the bypass paths in 
place and highlighting the timing of the register read and writes. This code 
sequence can be executed without stalls. 

Forwarding can be generalized to include passing a result directly to the func- 
tional unit that requires it: A result is forwarded from the pipeline register corre- 
sponding to the output of one unit to the input of another, rather than just from 
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Figure A.7 A set of instructions that depends on the DADD result uses forwarding paths to avoid the data hazard. 
The inputs for the DSUB and AND instructions forward from the pipeline registers to the first ALU input.The OR receives 
its result by forwarding through the register file, which is easily accomplished by reading the registers in the second 
half of the cycle and writing in the first half, as the dashed lines on the registers indicate. Notice that the forwarded 
result can go to either ALU input; in fact, both ALU inputs could use forwarded inputs from either the same pipeline 
register or from different pipeline registers.This would occur, for example, ifthe AND instruction was AND R6, RI, R4. 


the result of a unit to the input of the same unit. Take, for example, the following 


sequence: 
DADD R1,R2,R3 
LD R4,0(R1) 
D R4,12(R1) 


To prevent a stall in this sequence, we would need to forward the values of the 
ALU output and memory unit output from the pipeline registers to the ALU and 
data memory inputs. Figure A.8 shows all the forwarding paths for this example. 
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Figure A.8 Forwarding of operand required by stores during MEM. The result of the load is forwarded from the 
memory output to the memory input to be stored. In addition, the ALU output is forwarded to the ALU input for the 
address calculation of both the load and the store (this is no different than forwarding to another ALU operation). If 
the store depended on an immediately preceding ALU operation (not shown above), the result would need to be for- 
warded to prevent a stall. 


Data Hazards Requiring Stalls 


Unfortunately, not all potential data hazards can be handled by bypassing. 
Consider the following sequence of instructions: 


LD R1,0(R2) 
DSUB R4.R1.R5 
AND R6.R1.R7 
er} R8,R1,R9 


The pipelined data path with the bypass paths for this example is shown in 
Figure A.9. This case is different from the situation with back-to-back ALU 
operations. The LD instruction does not have the data until the end of clock cycle 
4 (its MEM cycle), while the DSUB instruction needs to have the data by the 
beginning of that clock cycle. Thus, the data hazard from using the result of a 
load instruction cannot be completely eliminated with simple hardware. As Fig- 
ure A.9 shows, such a forwarding path would have to operate backward in 
time—a capability not yet available to computer designers! We can forward the 
result immediately to the ALU from the pipeline registers for use in the AND oper- 
ation, which begins 2 clock cycles after the load. Likewise, the OR instruction has 
no problem, since it receives the value through the register file. For the DSUB 
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Figure A.9 The load instruction can bypass its results to the AND and OR instructions, but not to the DSUB, since 
that would mean forwarding the result in "negative time." 


instruction, the forwarded result arrives too late—at the end of a clock cycle, 
when it is needed at the beginning. 

The load instruction has a delay or latency that cannot be eliminated by for- 
warding alone. Instead, we need to add hardware, called a pipeline interlock, to 
preserve the correct execution pattern. In general, a pipeline interlock detects a 
hazard and stalls the pipeline until the hazard is cleared. In this case, the interlock 
stalls the pipeline, beginning with the instruction that wants to use the data until 
the source instruction produces it. This pipeline interlock introduces a stall or 
bubble, just as it did for the structural hazard. The CPI for the stalled instruction 
increases by the length of the stall (1 clock cycle in this case). 

Figure A. 10 shows the pipeline before and after the stall using the names of the 
pipeline stages. Because the stall causes the instructions starting with the DSUB to 
move 1 cycle later in time, the forwarding to the AND instruction now goes 
through the register file, and no forwarding at all is needed for the OR instruction. 
The insertion of the bubble causes the number of cycles to complete this 
sequence to grow by one. No instruction is started during clock cycle 4 (and none 
finishes during cycle 6). 
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LD R1,0(R2) IF ID EX MEM WB 

DSUB R4.R1.R5 IF ID EX MEM WB 

AND R6,R1,R7 IF ID EX MEM WB 

OR R8,R1,R9 IF ID EX MEM WB 

LD R1,0(R2) IF ID EX MEM WB 

DSUB R4.R1.R5 IF ID stall EX MEM WB 

AND R6.R1.R7 IF stall ID EX MEM WB 

OR R8.R1.R9 stall IF ID EX MEM WB 





Figure A.10 In the top half, we can see why a stall is needed:The MEM cycle of the load produces a value that is 
needed in the EX cycle of the DSUB, which occurs at the same time. This problem is solved by inserting a stall, as 


shown in the bottom half. 


Branch Hazards 


Control hazards can cause a greater performance loss for our MIPS pipeline than 
do data hazards. When a branch is executed, it may or may not change the PC to 
something other than its current value plus 4. Recall that if a branch changes the 
PC to its target address, it is a taken branch; if it falls through, it is not taken, or 
untaken. If instruction i is a taken branch, then the PC is normally not changed 
until the end of ID, after the completion of the address calculation and com- 
parison. 

Figure A. 11 shows that the simplest method of dealing with branches is to 
redo the fetch of the instruction following a branch, once we detect the branch 
during ID (when instructions are decoded). The first IF cycle is essentially a stall, 
because it never performs useful work. You may have noticed that if the branch is 
untaken, then the repetition of the IF stage is unnecessary since the correct instruc- 
tion was indeed fetched. We will develop several schemes to take advantage of this 
fact shortly. 

One stall cycle for every branch will yield a performance loss of 10% to 30% 
depending on the branch frequency, so we will examine some techniques to deal 
with this loss. 














Branch instruction IF ID EX MEM WB 

Branch successor IF IF ID EX MEM WB 

Branch successor + 1 IF ID EX MEM 
Branch successor + 2 IF ID EX 





Figure A.11 A branch causes a 1 -cycle stall in the five-stage pipeline. The instruction 
after the branch is fetched, but the instruction is ignored, and the fetch is restarted 
once the branch target is known. It is probably obvious that if the branch is not taken, 
the second IF for branch successor is redundant.This will be addressed shortly. 
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Reducing Pipeline Branch Penalties 


There are many methods for dealing with the pipeline stalls caused by branch 
delay; we discuss four simple compile time schemes in this subsection. In these 
four schemes the actions for a branch are static—they are fixed for each branch 
during the entire execution. The software can try to minimize the branch penalty 
using knowledge of the hardware scheme and of branch behavior. Chapters 2 and 
3 look at more powerful hardware and software techniques for both static and 
dynamic branch prediction. 

The simplest scheme to handle branches is to freeze ox flush the pipeline, 
holding or deleting any instructions after the branch until the branch destination 
is known. The attractiveness of this solution lies primarily in its simplicity both 
for hardware and software. It is the solution used earlier in the pipeline shown in 
Figure A.11. In this case the branch penalty is fixed and cannot be reduced by 
software. 

A higher-performance, and only slightly more complex, scheme is to treat 
every branch as not taken, simply allowing the hardware to continue as if the 
branch were not executed. Here, care must be taken not to change the processor 
state until the branch outcome is definitely known. The complexity of this 
scheme arises from having to know when the state might be changed by an 
instruction and how to "back out" such a change. 

In the simple five-stage pipeline, this predicted-not-taken or predicted- 
untaken scheme is implemented by continuing to fetch instructions as if the 
branch were a normal instruction. The pipeline looks as if nothing out of the ordi- 
nary is happening. If the branch is taken, however, we need to turn the fetched 
instruction into a no-op and restart the fetch at the target address. Figure A. 12 
shows both situations. 



































Untaken branch instruction IF ID EX MEM WB 

Instruction i + 1 IF ID EX MEM WB 

Instruction i + 2 IF ID EX MEM WB 

Instruction i + 3 IF ID EX MEM WB 
Instruction i + 4 IF ID EX MEM WB 
Taken branch instruction IF ID EX MEM WB 

Instruction i + 1 IF idle idle idle idle 

Branch target IF ID EX MEM WB 

Branch target + 1 IF ID EX MEM WB 

Branch target + 2 IF ID EX MEM WB 





Figure A.12 The predicted-not-taken scheme and the pipeline sequence when the branch is untaken (top) and 
taken (bottom). When the branch is untaken, determined during ID, we have fetched the fall-through and just con- 
tinue. Ifthe branch is taken during ID, we restart the fetch at the branch target.This causes all instructions following 
the branch to stall 1 clock cycle. 
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An alternative scheme is to treat every branch as taken. As soon as the branch 
is decoded and the target address is computed, we assume the branch to be taken 
and begin fetching and executing at the target. Because in our five-stage pipeline 
we don't know the target address any earlier than we know the branch outcome, 
there is no advantage in this approach for this pipeline. In some processors— 
especially those with implicitly set condition codes or more powerful (and hence 
slower) branch conditions—the branch target is known before the branch out- 
come, and a predicted-taken scheme might make sense. In either a predicted- 
taken or predicted-not-taken scheme, the compiler can improve performance by 
organizing the code so that the most frequent path matches the hardware's 
choice. Our fourth scheme provides more opportunities for the compiler to 
improve performance. 

A fourth scheme in use in some processors is called delayed branch. This 
technique was heavily used in early RISC processors and works reasonably well 
in the five-stage pipeline. In a delayed branch, the execution cycle with a branch 
delay of one is 


branch instruction 
sequential successor 
branch target if taken 


The sequential successor is in the branch delay slot. This instruction is executed 
whether or not the branch is taken. The pipeline behavior of the five-stage pipe- 
line with a branch delay is shown in Figure A. 13. Although it is possible to have 
a branch delay longer than one, in practice, almost all processors with delayed 
branch have a single instruction delay; other techniques are used if the pipeline 
has a longer potential branch penalty. 






































Untaken branch instruction IF ID EX MEM WB 

Branch delay instruction (i + 1) IF ID EX MEM WB 

Instruction ;' + 2 IF ID EX MEM WB 

Instruction ;' + 3 IF ID EX MEM WB 
Instruction;' + 4 IF ID EX MEM WB 
Taken branch instruction IF ID EX MEM WB 

Branch delay instruction (i + 1) IF ID EX MEM WB 

Branch target IF ID EX MEM WB 

Branch target + 1 IF ID EX MEM WB 

Branch target + 2 IF ID EX MEM WB 





Figure A.I 3 The behavior of a delayed branch is the same whether or not the branch is taken. The instructions in 
the delay slot (there is only one delay slot for MIPS) are executed. If the branch is untaken, execution continues with 
the instruction after the branch delay instruction; if the branch is taken, execution continues at the branch target. 
When the instruction in the branch delay slot is also a branch, the meaning is unclear: If the branch is not taken, what 
should happen to the branch in the branch delay slot? Because of this confusion, architectures with delay branches 
often disallow putting a branch in the delay slot. 
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The job of the compiler is to make the successor instructions valid and useful. 
A number of optimizations are used. Figure A. 14 shows the three ways in which 
the branch delay can be scheduled. 

The limitations on delayed-branch scheduling arise from (1) the restrictions on 
the instructions that are scheduled into the delay slots and (2) our ability to predict 
at compile time whether a branch is likely to be taken or not. To improve the ability 
of the compiler to fill branch delay slots, most processors with conditional branches 
have introduced a canceling or nullifying branch. In a canceling branch, the instruc- 
tion includes the direction that the branch was predicted. When the branch behaves 
as predicted, the instruction in the branch delay slot is simply executed as it would 


(a) From before (b) From target (c) From fall-through 









DADD R1, R2, R3 


DADD R1, R2, R3 


DSUB R4, R5, R6 


if R1 = 0 then 


Delay siot 


if R2 = 0 then 


Delay slot DADD R1, R2, R3 


if R1 = 0 then 


Delay slot 


OR R7, R8, R9 


DSUB R4, R5, R6 












becomes becomes 








DSUB R4. R5, R6 DADD R1, R2, R3 


if R1 =0 then —— 


OR R7, R8, R9 


if R2 = 0 then 


DADD R1, R2, R3 DADD R1, R2, R3 


if R1 = 0 then 


DSUB R4, R5, R6 


Ż] 


DSUB R4, R5, R6 








Figure A.I14 Scheduling the branch delay slot. The top box in each pair shows the 
code before scheduling;the bottom box shows the scheduled code. In (a) the delay slot 
is scheduled with an independent instruction from before the branch. This is the best 
choice. Strategies (b) and (c) are used when (a) is not possible. In the code sequences 
for (b) and (c), the use of R1 in the branch condition prevents the DADD instruction 
(whose destination is RI) from being moved after the branch. In (b) the branch delay 
slot is scheduled from the target of the branch; usually the target instruction will need 
to be copied because it can be reached by another path. Strategy (b) is preferred when 
the branch is taken with high probability, such as a loop branch. Finally, the branch may 
be scheduled from the not-taken fall-through as in (c).To make this optimization legal 
for (b) or (c), it must be OK to execute the moved instruction when the branch goes in 
the unexpected direction. By OK we mean thatthe work is wasted, but the program will 
still execute correctly. This is the case, for example, in (c) if R7 were an unused tempo- 
rary register when the branch goes in the unexpected direction. 


Example 
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normally be with a delayed branch. When the branch is incorrectly predicted, the 
instruction in the branch delay slot is simply turned into a no-op. 


Performance of Branch Schemes 


What is the effective performance of each of these schemes? The effective pipe- 
line speedup with branch penalties, assuming an ideal CPI of 1, is 


Pipeline speedup = Pipeline depth 





1 + Pipeline stall cycles from branches 


1 + Pipeline stall cycles from branches 
Because of the following: 
Pipeline stall cycles from branches = Branch frequency X Branch penalty 


we obtain 


Pipeline speedup = aee e peine depti E 


1 + Branch frequency x Branch penalty 


The branch frequency and branch penalty can have a component from both 
unconditional and conditional branches. However, the latter dominate since they 
are more frequent. 


For a deeper pipeline, such as that in a MIPS R4000, it takes at least three pipe- 
line stages before the branch-target address is known and an additional cycle 
before the branch condition is evaluated, assuming no stalls on the registers in the 
conditional comparison. A three-stage delay leads to the branch penalties for the 
three simplest prediction schemes listed in Figure A. 15. 

Find the effective addition to the CPI arising from branches for this pipeline, 
assuming the following frequencies: 


























Unconditional branch 4% 

Conditional branch, untaken 6% 

Conditional branch, taken 10% 
Branch scheme Penalty unconditional Penalty untaken Penalty taken 
Flush pipeline 2 3 3 
Predicted taken 2 3 2 
Predicted untaken 2 0 3 





Figure A.15 Branch penalties for the three simplest prediction schemes for a deeper pipeline. 
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Additions to the CPI from branch costs 

















Unconditional Untaken conditional Taken conditional 
Branch scheme branches branches branches All branches 
Frequency of event 4% 6% 10% 20% 
Stall pipeline 0.08 0.18 0.30 0.56 
Predicted taken 0.08 0.18 0.20 0.46 
Predicted untaken 0.08 0.00 0.30 0.38 





Figure A.16 CPI penalties for three branch-prediction schemes and a deeper pipeline. 


Answer 


We find the CPIs by multiplying the relative frequency of unconditional, condi- 
tional untaken, and conditional taken branches by the respective penalties. The 
results are shown in Figure A. 16. 

The differences among the schemes are substantially increased with this 
longer delay. If the base CPI were 1 and branches were the only source of stalls, 
the ideal pipeline would be 1.56 times faster than a pipeline that used the stall- 
pipeline scheme. The predicted-untaken scheme would be 1.13 times better than 
the stall-pipeline scheme under the same assumptions. 


How Is Pipelining Implemented? 


Before we proceed to basic pipelining, we need to review a simple implementa- 
tion of an unpipelined version of MIPS. 


A Simple Implementation of MIPS 


In this section we follow the style of Section A.1, showing first a simple unpipe- 
lined implementation and then the pipelined implementation. This time, however, 
our example is specific to the MIPS architecture. 

In this subsection we focus on a pipeline for an integer subset of MIPS that 
consists of load-store word, branch equal zero, and integer ALU operations. Later 
in this appendix, we will incorporate the basic floating-point operations. 
Although we discuss only a subset of MIPS, the basic principles can be extended 
to handle all the instructions. We initially used a less aggressive implementation 
of a branch instruction. We show how to implement the more aggressive version 
at the end of this section. 

Every MIPS instruction can be implemented in at most 5 clock cycles. The 5 
clock cycles are as follows. 


1. Instruction fetch cycle (IF): 


IR < Mem [PC]; 
NPC <— PC +4; 


2: 
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Operation: Send out the PC and fetch the instruction from memory into the 
instruction register (IR); increment the PC by 4 to address the next sequential 
instruction. The IR is used to hold the instruction that will be needed on sub- 
sequent clock cycles; likewise the register NPC is used to hold the next 
sequential PC. 


Instruction decode/register fetch cycle (ID): 


A & Regs[rs]; 
B< Regs[rt]; 
Imm sign-extended immediate field of IR; 


Operation: Decode the instruction and access the register file to read the 
registers (rs and rt are the register specifiers). The outputs of the general- 
purpose registers are read into two temporary registers (A and B) for use in 
later clock cycles. The lower 16 bits of the IR are also sign extended and 
stored into the temporary register Imm, for use in the next cycle. 

Decoding is done in parallel with reading registers, which is possible 
because these fields are at a fixed location in the MIPS instruction format 
(see Figure B.22 on page B-35). Because the immediate portion of an 
instruction is located in an identical place in every MIPS format, the sign- 
extended immediate is also calculated during this cycle in case it is needed 
in the next cycle. 


Execution/effective address cycle (EX): 


The ALU operates on the operands prepared in the prior cycle, performing 
one of four functions depending on the MIPS instruction type. 


e Memory reference: 
ALUOutput < A + Imm; 


Operation: The ALU adds the operands to form the effective address and 
places the result into the register ALUOutput. 


e Register-Register ALU instruction: 

ALUOutput < A func B; 
Operation: The ALU performs the operation specified by the function code 
on the value in register A and on the value in register B. The result is placed 
in the temporary register ALUOutput. 
e Register-Immediate ALU instruction: 

ALUOutput <— A op Imm; 


Operation: The ALU performs the operation specified by the opcode on the 
value in register A and on the value in register Imm. The result is placed in 
the temporary register ALUOutput. 


e Branch: 


ALUOutput <- NPC + (Imm « 2); 
Cond + (A == 0) 
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4. 


Operation: The ALU adds the NPC to the sign-extended immediate value in 
Imm, which is shifted left by 2 bits to create a word offset, to compute the 
address of the branch target. Register A, which has been read in the prior 
cycle, is checked to determine whether the branch is taken. Since we are 
considering only one form of branch (BEQZ), the comparison is against 0. 
Note that BBZ is actually a pseudoinstruction that translates to a BRQ with 
RO as an operand. For simplicity, this is the only form of branch we con- 
sider. 

The load-store architecture of MIPS means that effective address and 
execution cycles can be combined into a single clock cycle, since no instruc- 
tion needs to simultaneously calculate a data address, calculate an instruc- 
tion target address, and perform an operation on the data. The other integer 
instructions not included above are jumps of various forms, which are simi- 
lar to branches. 


Memory access/branch completion cycle (MEM): 
The PC is updated for all instructions: PC < NPC; 
e Memory reference: 


IMD <—- Mem[ALUOutput] or 
Mem[ALUOutput] < B; 


Operation: Access memory if needed. If instruction is a load, data returns 
from memory and is placed in the LMD (load memory data) register; if it is 
a store, then the data from the B register is written into memory. In either 
case the address used is the one computed during the prior cycle and stored 
in the register ALUOutput. 
e Branch: 

if (cond) PC = ALUOutput 
Operation: If the instruction branches, the PC is replaced with the branch 
destination address in the register ALUOutput. 
Write-back cycle (WB): 
e Register-Register ALU instruction: 

Regs[rd] + ALUOutput; 
e Register-Immediate ALU instruction: 

Regs[rt] <— ALUOutput; 
e Load instruction: 

Regs[rt] < LMD; 
Operation: Write the result into the register file, whether it comes from the 
memory system (which is in LMD) or from the ALU (which is in ALUOut- 


put); the register destination field is also in one of two positions (rd or rt) 
depending on the effective opcode. 
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Figure A. 17 shows how an instruction flows through the data path. At the end 
of each clock cycle, every value computed during that clock cycle and required 
on a later clock cycle (whether for this instruction or the next) is written into a 
storage device, which may be memory, a general-purpose register, the PC, or a 
temporary register (i.e., LMD, Imm, A, B, IR, NPC, ALUOutput, or Cond). The 
temporary registers hold values between clock cycles for one instruction, while 
the other storage elements are visible parts of the state and hold values between 
successive instructions. 

Although all processors today are pipelined, this multicycle implementation 
is a reasonable approximation of how most processors would have been imple- 
mented in earlier times. A simple finite-state machine could be used to implement 
the control following the 5-cycle structure shown above. For a much more com- 
plex processor, microcode control could be used. In either event, an instruction 
sequence like that above would determine the structure of the control. 
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Figure A.17 The implementation of the MIPS data path allows every instruction to be executed in 4 or 5 clock 
cycles. Although the PC is shown in the portion of the data path that is used in instruction fetch and the registers are 
shown in the portion of the data path that is used in instruction decode/register fetch, both of these functional units 
are read as well as written by an instruction. Although we show these functional units in the cycle corresponding to 
where they are read, the PC is written during the memory access clock cycle and the registers are written during the 
write-back clock cycle. In both cases, the writes in later pipe stages are indicated by the multiplexer output (in mem- 
ory access or write back), which carries a value back to the PC or registers. These backward-flowing signals introduce 
much of the complexity of pipelining, since they indicate the possibility of hazards. 
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There are some hardware redundancies that could be eliminated in this multi- 
cycle implementation. For example, there are two ALUs: one to increment the PC 
and one used for effective address and ALU computation. Since they are not 
needed on the same clock cycle, we could merge them by adding additional mul- 
tiplexers and sharing the same ALU. Likewise, instructions and data could be 
stored in the same memory, since the data and instruction accesses happen on dif- 
ferent clock cycles. 

Rather than optimize this simple implementation, we will leave the design as 
it is in Figure A. 17, since this provides us with a better base for the pipelined 
implementation. 

As an alternative to the multicycle design discussed in this section, we could 
also have implemented the CPU so that every instruction takes 1 long clock 
cycle. In such cases, the temporary registers would be deleted, since there would 
not be any communication across clock cycles within an instruction. Every 
instruction would execute in 1 long clock cycle, writing the result into the data 
memory, registers, or PC at the end of the clock cycle. The CPI would be one for 
such a processor. The clock cycle, however, would be roughly equal to five times 
the clock cycle of the multicycle processor, since every instruction would need to 
traverse all the functional units. Designers would never use this single-cycle 
implementation for two reasons. First, a single-cycle implementation would be 
very inefficient for most CPUs that have a reasonable variation among the 
amount of work, and hence in the clock cycle time, needed for different instruc- 
tions. Second, a single-cycle implementation requires the duplication of func- 
tional units that could be shared in a multicycle implementation. Nonetheless, 
this single-cycle data path allows us to illustrate how pipelining can improve the 
clock cycle time, as opposed to the CPI, of a processor. 


A Basic Pipeline for MIPS 


As before, we can pipeline the data path of Figure A. 17 with almost no changes 
by starting a new instruction on each clock cycle. Because every pipe stage is 
active on every clock cycle, all operations in a pipe stage must complete in 1 
clock cycle and any combination of operations must be able to occur at once. 
Furthermore, pipelining the data path requires that values passed from one pipe 
stage to the next must be placed in registers. Figure A. 18 shows the MIPS pipe- 
line with the appropriate registers, called pipeline registers or pipeline latches, 
between each pipeline stage. The registers are labeled with the names of the 
stages they connect. Figure A. 18 is drawn so that connections through the pipe- 
line registers from one stage to another are clear. 

All of the registers needed to hold values temporarily between clock cycles 
within one instruction are subsumed into these pipeline registers. The fields of 
the instruction register (IR), which is part of the IF/ID register, are labeled when 
they are used to supply register names. The pipeline registers carry both data and 
control from one pipeline stage to the next. Any value needed on a later pipeline 
stage must be placed in such a register and copied from one pipeline register to 
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Figure A.I 8 The data path is pipelined by adding a set of registers, one between each pair of pipe stages. The 
registers serve to convey values and control information from one stage to the next. We can also think of the PC as i 
pipeline register, which sits before the IF stage of the pipeline, leading to one pipeline register for each pipe stage 
Recall that the PC is an edge-triggered register written at the end of the clock cycle; hence there is no race conditior 
in writing the PC.The selection multiplexer for the PC has been moved so that the PC is written in exactly one stag” 
(IF). If we didn't move it, there would be a conflict when a branch occurred, since two instructions would try to writ< 
different values into the PC. Most of the data paths flow from left to right, which is from earlier in time to later. Th( 
paths flowing from right to left (which carry the register write-back information and PC information on a branch 
introduce complications into our pipeline. 


the next, until it is no longer needed. If we tried to just use the temporary register; 
we had in our earlier unpipelined data path, values could be overwritten before al 
uses were completed. For example, the field of a register operand used for a writ* 
on a load or ALU operation is supplied from the MEM/WB pipeline registe 
rather than from the IF/ID register. This is because we want a load or ALU opera 
tion to write the register designated by that operation, not the register field of thi 
instruction currently transitioning from IF to ID! This destination register field i 
simply copied from one pipeline register to the next, until it is needed during tb 
WB stage. 

Any instruction is active in exactly one stage of the pipeline at a time; there 
fore, any actions taken on behalf of an instruction occur between a pair of pipelin 
registers. Thus, we can also look at the activities of the pipeline by examinin 
what has to happen on any pipeline stage depending on the instruction type. Fig 
ure A. 19 shows this view. Fields of the pipeline registers are named so as to sho\ 
the flow of data from one stage to the next. Notice that the actions in the first tw 
stages are independent of the current instruction type; they must be independer 
because the instruction is not decoded until the end of the ID stage. The IF activit 
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Ő 
Stage Any instruction 


IF IF/ID.IR 4 Mem[PC] ; 
IF/ID.NPC,PC < (if ((EX/MEM.opcode == branch) & EX/MEM. cond) {EX/MEM. 
ALUOutput} else {PC+4}); 

ID ID/EX.A «~ Regs[IF/ID.IR[rs]]; ID/EX.B + Regs[IF/ID.IR[rt]]; 


ID/EX.NPC <IF/ID.NPC; ID/EX.IR < IF/ID. IR; 
ID/EX.Imm <— sign-extend(IF/ID.IR[immediate field]); 














ALU instruction 


Load or store instruction 


Branch instruction 








EX 


EX/MEM. IR < ID/EX. IR; 
EX/MEM. ALUOutput <— 
ID/EX.A func ID/EX.B; 


EX/MEM.IR to ID/EX.IR 
EX/MEM.ALUOutput + 
ID/EX.A + ID/EX. Imm; 


EX/MEM.ALUOutput + 
ID/EX.NPC + 


or (ID/EX.Imm << 2); 
EX/MEM.ALUOutput <— 
ID/EX.A op ID/EX. Imm; 

EX/MEM.B < ID/EX.B; EX/MEM. cond < 


(ID/EX.A == 0); 





MEM MEM/WB.IR < EX/MEM. IR; 
MEM/WB.ALUOutput + 
EX/MEM.ALUOutput ; 


MEM/WB.IR < EX/MEM.IR; 
MEM/WB. LMD <+ 
Mem[EX/MEM.ALUOutput] ; 
or 

Mem[EX/MEM.ALUOutput] + 
EX/MEM. B; 








WB Regs [MEM/WB. IR[rd]] + For load only: 
MEM/WB. ALUOutput; Regs [MEM/WB.IR[rt]] < 
or MEM/WB.LMD; 


Regs [MEM/WB.IR[rt]] + 
MEM/WB.ALUOutput; 


Figure A.19 Events on every pipe stage of the MIPS pipeline. Let's review the actions in the stages that are specific 
to the pipeline organization. In IF, in addition to fetching the instruction and computing the new PC, we store the 
incremented PC both into the PC and into a pipeline register (NPC) for later use in computing the branch-target 
address. This structure is the same as the organization in Figure A.18, where the PC is updated in IF from one of two 
sources. In ID, we fetch the registers, extend the sign of the lower 16 bits of the IR (the immediate field), and pass 
along the IR and NPC. During EX, we perform an ALU operation or an address calculation; we pass along the IR and 
theB register (if the instruction is a store). We also set the value of cond to 1 if the instruction is a taken branch. Dur- 
ing the MEM phase, we cycle the memory, write the PC if needed, and pass along values needed in the final pipe 
stage. Finally, during WB, we update the register field from either the ALU output or the loaded value. For simplicity 
we always pass the entire IR from one stage to the next, although as an instruction proceeds down the pipeline, less 
and less of the IR is needed. 


depends on whether the instruction in EX/MEM is a taken branch. If so, then the 
branch-target address of the branch instruction in EX/MEM is written into the PC 
at the end of IF; otherwise the incremented PC will be written back. (As we said 
earlier, this effect of branches leads to complications in the pipeline that we deal 
with in the next few sections.) The fixed-position encoding of the register source 
operands is critical to allowing the registers to be fetched during ID. 
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To control this simple pipeline we need only determine how to set the control 
for the four multiplexers in the data path of Figure A. 18. The two multiplexers in 
the ALU stage are set depending on the instruction type, which is dictated by the 
IR field of the ID/EX register. The top ALU input multiplexer is set by whether 
the instruction is a branch or not, and the bottom multiplexer is set by whether the 
instruction is a register-register ALU operation or any other type of operation. 
The multiplexer in the IF stage chooses whether to use the value of the incre- 
mented PC or the value of the EX/MEM.ALUOutput (the branch target) to write 
into the PC. This multiplexer is controlled by the field EX/MEM.cond. The 
fourth multiplexer is controlled by whether the instruction in the WB stage is a 
load or an ALU operation. In addition to these four multiplexers, there is one 
additional multiplexer needed that is not drawn in Figure A. 18, but whose exist- 
ence is clear from looking at the WB stage of an ALU operation. The destination 
register field is in one of two different places depending on the instruction type 
(register-register ALU versus either ALU immediate or load). Thus, we will need 
a multiplexer to choose the correct portion of the IR in the MEM/WB register to 
specify the register destination field, assuming the instruction writes a register. 


Implementing the Control for the MIPS Pipeline 


The process of letting an instruction move from the instruction decode stage (ID) 
into the execution stage (EX) of this pipeline is usually called instruction issue; 
an instruction that has made this step is said to have issued. For the MIPS integer 
pipeline, all the data hazards can be checked during the ID phase of the pipeline. 
If a data hazard exists, the instruction is stalled before it is issued. Likewise, we 
can determine what forwarding will be needed during ID and set the appropriate 
controls then. Detecting interlocks early in the pipeline reduces the hardware 
complexity because the hardware never has to suspend an instruction that has 
updated the state of the processor, unless the entire processor is stalled. Alterna- 
tively, we can detect the hazard or forwarding at the beginning of a clock cycle 
that uses an operand (EX and MEM for this pipeline). To show the differences in 
these two approaches, we will show how the interlock for a RAW hazard with the 
source coming from a load instruction (called a load interlock) can be imple- 
mented by a check in ID, while the implementation of forwarding paths to the 
ALU inputs can be done during EX. Figure A.20 lists the variety of circum- 
stances that we must handle. 

Let's start with implementing the load interlock. If there is a RAW hazard 
with the source instruction being a load, the load instruction will be in the EX 
stage when an instruction that needs the load data will be in the ID stage. Thus, 
we can describe all the possible hazard situations with a small table, which can be 
directly translated to an implementation. Figure A.21 shows a table that detects 
all load interlocks when the instruction using the load result is in the ID stage. 

Once a hazard has been detected, the control unit must insert the pipeline stall 
and prevent the instructions in the IF and ID stages from advancing. As we said 
earlier, all the control information is carried in the pipeline registers. (Carrying 
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Example code 
Situation sequence Action 





No dependence LD R1,45(R2) No hazard possible because no dependence 
DAD R5, R6, R7 exists on RI in the immediately following 
DAB R8,R6,R7 three instructions. 
OR R9,R6,R7 





Dependence LD  R1,45(R2) Comparators detect the use of RI in the DAD 
requiring stall DAD RS, RI, R7 and stall the DAD (and DSUB and OR) before 
DSUB_ R8,R6,R7 the DAD begins EX. 





OR = R9,R6,R7 
Dependence LD R1,45(R2) Comparators detect use of RI in DSUB and 
overcomeby DAD R5,R6,R7 forward result of load to ALU in time for DSUB 
forwarding DAB R8.R1.R7 to begin EX. 

OR  R9,R6,R7 





Dependence with LD RI, 45 (R2) No action required because the read of R1 by 

accesses in order DAD R5, R6, R7 OR occurs in the second half of the ID phase, 
DAUB R8,R6,R7 while the write of the loaded data occurred in 
OR  ě R9.R1.R7 the first half. 





Figure A.20 Situations that the pipeline hazard detection hardware can see by com- 
paring the destination and sources of adjacent instructions. This table indicates that 
the only comparison needed is between the destination and the sources on the two 
instructions following the instruction that wrote the destination. In the case of a stall, 
the pipeline dependences will look like the third case once execution continues. Of 
course hazards that involve RO can be ignored since the register always contains 0, and 
the test above could be extended to do this. 














Opcode field of ID/EX Opcode field of IFAD 

(ID/EX.IRO..5) (IF/ID.IRO..5) Matching operand fields 

Load Register-register ALU ID/EX.IR[rt] == IF/ 
ID.IR[rs] 

Load Register-register ALU ID/EX.IR[rt] == IF/ 
ID.IR[rt] 

Load Load, store, ALU immediate, ID/EX.IR[rt] — IF/ 

or branch ID.IR[rs] 





Figure A.21 The logic to detect the need for load interlocks during the ID stage of 
an instruction requires three comparisons. Lines 1 and 2 ofthe table test whether the 
load destination register is one of the source registers for a register-register operation 
in ID. Line 3 ofthe table determines ifthe load destination register is a source for a load 
or store effective address, an ALU immediate, or a branch test. Remember that the IF/ID 
register holds the state of the instruction in ID, which potentially uses the load result, 
while ID/EX holds the state of the instruction in EX, which is the load instruction. 


the instruction along is enough, since all control is derived from it.) Thus, when 
we detect a hazard we need only change the control portion of the ID/EX pipeline 
register to all Os, which happens to be a no-op (an instruction that does nothing, 
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such as DAD RO,RO,RO). In addition, we simply recirculate the contents of the 
IF/ID registers to hold the stalled instruction. In a pipeline with more complex 
hazards, the same ideas would apply: We can detect the hazard by comparing 
some set of pipeline registers and shift in no-ops to prevent erroneous execution. 

Implementing the forwarding logic is similar, although there are more cases 
to consider. The key observation needed to implement the forwarding logic is that 
the pipeline registers contain both the data to be forwarded as well as the source 
and destination register fields. All forwarding logically happens from the ALU or 
data memory output to the ALU input, the data memory input, or the zero detec- 
tion unit. Thus, we can implement the forwarding by a comparison of the destina- 
tion registers of the IR contained in the EX/MEM and MEM/WB stages against 
the source registers of the IR contained in the ID/EX and EX/MEM registers. 
Figure A.22 shows the comparisons and possible forwarding operations where 
the destination of the forwarded result is an ALU input for the instruction cur- 
rently in EX. 

In addition to the comparators and combinational logic that we need to deter- 
mine when a forwarding path needs to be enabled, we also need to enlarge the 
multiplexers at the ALU inputs and add the connections from the pipeline regis- 
ters that are used to forward the results. Figure A.23 shows the relevant segments 
of the pipelined data path with the additional multiplexers and connections in 
place. 

For MIPS, the hazard detection and forwarding hardware is reasonably sim- 
ple; we will see that things become somewhat more complicated when we 
extend this pipeline to deal with floating point. Before we do that, we need to 
handle branches. 


Dealing with Branches in the Pipeline 


In MIPS, the branches (BEQ and BNE) require testing a register for equality to 
another register, which may be RO. If we consider only the cases of BRY. and 
BNEZ which require a zero test, it is possible to complete this decision by the end 
of the ID cycle by moving the zero test into that cycle. To take advantage of an 
early decision on whether the branch is taken, both PCs (taken and untaken) must 
be computed early. Computing the branch-target address during ID requires an 
additional adder because the main ALU, which has been used for this function so 
far, is not usable until EX. Figure A.24 shows the revised pipelined data path. 
With the separate adder and a branch decision made during ID, there is only a 1- 
clock-cycle stall on branches. Although this reduces the branch delay to 1 cycle, 
it means that an ALU instruction followed by a branch on the result of the 
instruction will incur a data hazard stall. Figure A.25 shows the branch portion of 
the revised pipeline table from Figure A. 19. 

In some processors, branch hazards are even more expensive in clock cycles 
than in our example, since the time to evaluate the branch condition and compute 
the destination can be even longer. For example, a processor with separate decode 
and register fetch stages will probably have a branch delay—the length of the 
control hazard—that is at least 1 clock cycle longer. The branch delay, unless it is 
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Pipeline 
register Destination 
Pipeline register Opcode containing of the 
containing source of source destination Opcode ofdestination forwarded Comparison (if 
instruction instruction instruction instruction result equal then forward) 
EX/MEM Register- ID/EX Register-register ALU, Top ALU EX/MEM.IR[rd] == 
register ALU ALU immediate, load, input ID/EX.IR[rs] 
store, branch 
EX/MEM Register- ID/EX Register-register ALU Bottom ALU EX/MEM.IR[rd] == 
register ALU input ID/EX.IR[rt] 
MEM/WB Register- ID/EX Register-register ALU, Top ALU MEM/WB.IR[rd] == 
register ALU ALU immediate, load, input ID/EX. IR[rs] 
store, branch 
MEM/WB Register- ID/EX Register-register ALU Bottom ALU MEM/WB.IR[rd] == 
register ALU input ID/EX.IR[rt] 
EX/MEM ALU ID/EX Register-register ALU, Top ALU EX/MEM.IR[rt] == 
immediate ALU immediate, load, input ID/EX.IR[rs] 
store, branch 
EX/MEM ALU ID/EX Register-register ALU Bottom ALU EX/MEM_.JIR[rt] == 
immediate input ID/EX.IR[rt] 
MEM/WB ALU ID/EX Register-register ALU, Top ALU MEM/WB.IR[rt] == 
immediate ALU immediate, load, input ID/EX.IR[rs] 
store, branch 
MEM/WB ALU ID/EX Register-register ALU Bottom ALU MEM/WB.IR[rt] == 
immediate input ID/EX.IR[rt] 
MEM/WB Load ID/EX Register-register ALU, Top ALU MEM/WB.IR[rt] == 
ALU immediate, load, input ID/EX.IR[rs] 
store, branch 
MEM/WB Load ID/EX Register-register ALU Bottom ALU MEM/WB.IR[rt] == 














input ID/EX.IR[rt] 





Figure A.22 Forwarding of data to the two ALU inputs (for the instruction in EX) can occur from the ALU result 
(in EXMEM or in MEMWB) or from the load result in MEM/WB. There are 10 separate comparisons needed to tell 
whether a forwarding operation should occur.The top and bottom ALU inputs refer to the inputs corresponding to 
the first and second ALU source operands, respectively, and are shown explicitly in Figure A.I 7 on page A-29 and in 
Figure A.23 on pageA-37. Remember that the pipeline latch for destination instruction in EX is ID/EX, while the 
source values come from the ALUOutput portion of EXMMEM or MEMWB or the LMD portion of MEM/WB. There is 
one complication not addressed by this logic: dealing with multiple instructions that write the same register. For 
example, during the code sequence DAD RI, R2, R3; DADDI RI, RI, #2; DAB R4, R3, RI, the logic must ensure 
that the DAB instruction uses the result of the DADDI instruction rather than the result of the DADD instruction. The 
logic shown above can be extended to handle this case by simply testing that forwarding from MEMMB is enabled 
only when forwarding from EX/MEM is not enabled for the same input. Because the DADD! result will be in EX/MEM, it 
will be forwarded, rather than the DADD result in MEM/WB. 


dealt with, turns into a branch penalty. Many older CPUs that implement more 
complex instruction sets have branch delays of 4 clock cycles or more, and large, 
deeply pipelined processors often have branch penalties of 6 or 7. In general, the 
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ID/EX EX/MEM MEM/WB 





























Figure A.23 Forwarding of results to the ALU requires the addition of three extra 
inputs on each ALU multiplexer and the addition of three paths to the new inputs. 
The paths correspond to a bypass of (1) the ALU output at the end ofthe EX, (2) the ALU 
output at the end ofthe MEM stage, and (3) the memory output at the end ofthe MEM 
stage. 


deeper the pipeline, the worse the branch penalty in clock cycles. Of course, the 
relative performance effect of a longer branch penalty depends on the overall CPI 
of the processor. A low-CPI processor can afford to have more expensive 
branches because the percentage of the processor's performance that will be lost 
from branches is less. 


What Makes Pipelining Hard to Implement? 


Now that we understand how to detect and resolve hazards, we can deal with 
some complications that we have avoided so far. The first part of this section 
considers the challenges of exceptional situations where the instruction execution 
order is changed in unexpected ways. In the second part of this section, we dis- 
cuss some of the challenges raised by different instruction sets. 


Dealing with Exceptions 


Exceptional situations are harder to handle in a pipelined CPU because the over- 
lapping of instructions makes it more difficult to know whether an instruction can 
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Figure A.24 The stall from branch hazards can be reduced by moving the zero test and branch-target calcula- 
tion into the ID phase of the pipeline. Notice that we have made two important changes, each of which removes 1 
cycle from the 3-cycle stall for branches.The first change is to move both the branch-target address calculation and 
the branch condition decision to the ID cycle.The second change is to write the PC of the instruction in the IF phase, 
using either the branch-target address computed during ID or the incremented PC computed during IF. In compari- 
son, Figure A.18 obtained the branch-target address from the EX/MEM register and wrote the result during the MEM 
clock cycle. As mentioned in Figure A.18, the PC can bethought of as a pipeline register (e.g., as part of ID/IF), which 
is written with the address of the next instruction at the end of each IF cycle. 


safely change the state of the CPU. In a pipelined CPU, an instruction is executed 
piece by piece and is not completed for several clock cycles. Unfortunately, other 
instructions in the pipeline can raise exceptions that may force the CPU to abort 
the instructions in the pipeline before they complete. Before we discuss these 
problems and their solutions in detail, we need to understand what types of situa- 
tions can arise and what architectural requirements exist for supporting them. 


Types of Exceptions and Requirements 


The terminology used to describe exceptional situations where the normal execu- 
tion order of instruction is changed varies among CPUs. The terms interrupt, 
fault, and exception are used, although not in a consistent fashion. We use the 
term exception to cover all these mechanisms, including the following: 


e I/O device request 


e Invoking an operating system service from a user program 
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Pipe stage Branch instruction 


IF IF/ID.IR ¢Mem[PC] ; 
IF/ID.NPC,PC <— (if ((IF/ID.opcode == branch) & (Regs[IF/ID.IR, 10] 
op 0)) {IF/ID.NPC + sign-extended (IF/ID.1R[ immediate field] <<2) else {PC+4}); 


ID ID/EX.A <Regs[IF/ID.IRg 9]; ID/EX.B < Regs[IF/ID.1R\;. 15]; 
ID/EX.IR < IF/ID. IR; 
ID/EX. Imm < (IF/ID. IRs) °##IF/ID.IRy6. 31 














EX 
MEM 
WB 











Figure A.25 This revised pipeline structure is based on the original in Figure A.19. It uses a separate adder, as in 
Figure A.24, to compute the branch-target address during ID.The operations that are new or have changed are in 
bold. Because the branch-target address addition happens during ID, it will happen for all instructions; the branch 
condition (Regs [IF/ID. IR610] op 0) will also be done for all instructions. The selection of the sequential PC or the 
branch-target PC still occurs during IF, but it now uses values from the ID stage, which correspond to the values set 
by the previous instruction.This change reduces the branch penalty by 2 cycles: one from evaluating the branch tar- 
get and condition earlier and one from controlling the PC selection on the same clock rather than on the next clock. 
Since the value of cond is set to 0, unless the instruction in ID is a taken branch, the processor must decode the 
instruction before the end of ID. Because the branch is done by the end of ID,the EX, MEM.and WB stages are unused 
for branches. An additional complication arises for jumps that have a longer offset than branches. We can resolve this 
by using an additional adder that sums the PC and lower 26 bits of the IR after shifting left by 2 bits. 


e Tracing instruction execution 
e Breakpoint (programmer-requested interrupt) 
e Integer arithmetic overflow 
e FP arithmetic anomaly 
e Page fault (not in main memory) 
e Misaligned memory accesses (if alignment is required) 
e Memory protection violation 
e Using an undefined or unimplemented instruction 
e Hardware malfunctions 
e Power failure 
When we wish to refer to some particular class of such exceptions, we will use 
a longer name, such as I/O interrupt, floating-point exception, or page fault. 
Figure A.26 shows the variety of different names for the common exception 
events above. 
Although we use the term exception to cover all of these events, individual 
events have important characteristics that determine what action is needed in the 


hardware. The requirements on exceptions can be characterized on five semi- 
independent axes: 
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Exception event IBM 360 VAX Motorola 680x0 Intel 80x86 

TO device request Input/output Device interrupt Exception (level 0...7 — Vectored interrupt 
interruption autovector) 

Invoking the operating Supervisor call Exception (change Exception Interrupt 

system service froma interruption mode supervisor trap) (unimplemented (INT instruction) 

user program instruction)— 


on Macintosh 





Tracing instruction 
execution 


Not applicable 


Exception (trace fault) 


Exception (trace) 


Interrupt (single- 
step trap) 





Breakpoint 


Not applicable 


Exception 
(breakpoint fault) 


Exception (illegal 
instruction or 
breakpoint) 


Interrupt 
(breakpoint trap) 





Integer arithmetic 
overflow or underflow; 
FP trap 


Program interruption 
(overflow or 
underflow exception) 


Exception (integer 
overflow trap or 
floating underflow 
fault) 


Exception 
(floating-point 
coprocessor errors) 


Interrupt (overflow 
trap or math unit 
exception) 





Page fault 
(not in main memory) 


Not applicable 
(only in 370) 


Exception (translation 


not valid fault) 


Exception (memory- 
management unit 
errors) 


Interrupt 
(page fault) 





Misaligned memory 


Program interruption 


Not applicable 


Exception 


Not applicable 














accesses (specification (address error) 
exception) 
Memory protection Program interruption Exception (access Exception Interrupt 
violations (protection exception) control violation (bus error) (protection 
fault) exception) 
Using undefined Program interruption Exception (opcode Exception (illegal Interrupt (invalid 
instructions (operation exception) _ privileged/reserved instruction or break- opcode) 
fault) point/unimpl emented 
instruction) 
Hardware Machine-check Exception (machine- Exception Not applicable 
malfunctions interruption check abort) (bus error) 
Power failure Machine-check Urgent interrupt Not applicable Nonmaskable 
interruption interrupt 





Figure A.26 The names of common exceptions vary across four different architectures. Every event on the IBM 

360 and 80x86 is called an interrupt, while every event on the 680x0 is called an exception. VAX divides events into 
interrupts or exceptions. Adjectives device, software, and urgentare used with VAX interrupts, while VAX exceptions are 
subdivided into faults, traps, and aborts. 


1. Synchronous versus asynchronous—If the event occurs at the same place 
every time the program is executed with the same data and memory alloca- 
tion, the event is synchronous. With the exception of hardware malfunctions, 
asynchronous events are caused by devices external to the CPU and memory- 
Asynchronous events usually can be handled after the completion of the 
current instruction, which makes them easier to handle. 


2. User requested versus coerced—If the user task directly asks for it, it is a 
user-requested event. In some sense, user-requested exceptions are not really 
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exceptions, since they are predictable. They are treated as exceptions, how- 
ever, because the same mechanisms that are used to save and restore the state 
are used for these user-requested events. Because the only function of an 
instruction that triggers this exception is to cause the exception, user- 
requested exceptions can always be handled after the instruction has com- 
pleted. Coerced exceptions are caused by some hardware event that is not 
under the control of the user program. Coerced exceptions are harder to 
implement because they are not predictable. 


3. User maskable versus user nonmaskable—If an event can be masked or dis- 
abled by a user task, it is user maskable. This mask simply controls whether 
the hardware responds to the exception or not. 


4. Within versus between instructions—This classification depends on whether 
the event prevents instruction completion by occurring in the middle of exe- 
cution—no matter how short—or whether it is recognized between instruc- 
tions. Exceptions that occur within instructions are usually synchronous, 
since the instruction triggers the exception. It's harder to implement excep- 
tions that occur within instructions than those between instructions, since the 
instruction must be stopped and restarted. Asynchronous exceptions that 
occur within instructions arise from catastrophic situations (e.g., hardware 
malfunction) and always cause program termination. 


5. Resume versus terminate—If the program's execution always stops after the 
interrupt, it is a terminating event. If the program's execution continues after 
the interrupt, it is a resuming event. It is easier to implement exceptions that 
terminate execution, since the CPU need not be able to restart execution of 
the same program after handling the exception. 


Figure A.27 classifies the examples from Figure A.26 according to these five 
categories. The difficult task is implementing interrupts occurring within instruc- 
tions where the instruction must be resumed. Implementing such exceptions 
requires that another program must be invoked to save the state of the executing 
program, correct the cause of the exception, and then restore the state of the pro- 
gram before the instruction that caused the exception can be tried again. This pro- 
cess must be effectively invisible to the executing program. If a pipeline provides 
the ability for the processor to handle the exception, save the state, and restart 
without affecting the execution of the program, the pipeline or processor is said 
to be restartable. While early supercomputers and microprocessors often lacked 
this property, almost all processors today support it, at least for the integer pipe- 
line, because it is needed to implement virtual memory (see Chapter 5). 


Stopping and Restarting Execution 


As in unpipelined implementations, the most difficult exceptions have two prop- 
erties: (1) they occur within instructions (that is, in the middle of the instruction 
execution corresponding to EX or MEM pipe stages), and (2) they must be 
restartable. In our MIPS pipeline, for example, a virtual memory page fault 
resulting from a data fetch cannot occur until sometime in the MEM stage of the 
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User Within vs. 
Synchronous vs. User request maskable vs. between Resume vs. 

Exception type asynchronous vs. coerced nonmaskable instructions terminate 
I/O device request Asynchronous Coerced Nonmaskable Between Resume 
Invoke operating system Synchronous User request | Nonmaskable Between Resume 
Tracing instruction execution Synchronous User request User maskable Between Resume 
Breakpoint Synchronous User request User maskable Between Resume 
Integer arithmetic overflow Synchronous Coerced User maskable Within Resume 
Floating-point arithmetic Synchronous Coerced User maskable Within Resume 
overflow or underflow 

Page fault Synchronous Coerced Nonmaskable Within Resume 
Misaligned memory accesses Synchronous Coerced User maskable Within Resume 
Memory protection violations Synchronous Coerced Nonmaskable Within Resume 
Using undefined instructions Synchronous Coerced Nonmaskable Within Terminate 
Hardware malfunctions Asynchronous Coerced Nonmaskable Within Terminate 
Power failure Asynchronous Coerced Nonmaskable Within Terminate 








Figure A.27 Five categories 


are used to define what actions are needed for the different exception types shown 


in Figure A.26. Exceptions that must allow resumption are marked as resume, although the software may often 
choose to terminate the program. Synchronous, coerced exceptions occurring within instructions that can be 
resumed are the most difficult to implement. We might expect that memory protection access violations would 


always result in termination; 


however, modern operating systems use memory protection to detect events such as 


the first attempt to use a page or the first write to a page. Thus, CPUs should be able to resume after such exceptions. 


instruction. By the time that fault is seen, several other instructions will be in exe- 
cution. A page fault must be restartable and requires the intervention of another 
process, such as the operating system. Thus, the pipeline must be safely shut 
down and the state saved so that the instruction can be restarted in the correct 
state. Restarting is usually implemented by saving the PC of the instruction at 
which to restart. If the restarted instruction is not a branch, then we will continue 
to fetch the sequential successors and begin their execution in the normal fashion. 
If the restarted instruction is a branch, then we will reevaluate the branch condi- 
tion and begin fetching from either the target or the fall-through. When an excep- 
tion occurs, the pipeline control can take the following steps to save the pipeline 
state safely: 


1. 
2 


Force a trap instruction into the pipeline on the next IF. 


Until the trap is taken, turn off all writes for the faulting instruction and for all 
instructions that follow in the pipeline; this can be done by placing zeros into 
the pipeline latches of all instructions in the pipeline, starting with the 
instruction that generates the exception, but not those that precede that 
instruction. This prevents any state changes for instructions that will not be 
completed before the exception is handled. 
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3. After the exception-handling routine in the operating system receives control, 
it immediately saves the PC of the faulting instruction. This value will be 
used to return from the exception later. 


When we use delayed branches, as mentioned in the last section, it is no 
longer possible to re-create the state of the processor with a single PC because 
the instructions in the pipeline may not be sequentially related. So we need to 
save and restore as many PCs as the length of the branch delay plus one. This is 
done in the third step above. 

After the exception has been handled, special instructions return the proces- 
sor from the exception by reloading the PCs and restarting the instruction stream 
(using the instruction RFE in MIPS). If the pipeline can be stopped so that the 
instructions just before the faulting instruction are completed and those after it 
can be restarted from scratch, the pipeline is said to have precise exceptions. Ide- 
ally, the faulting instruction would not have changed the state, and correctly han- 
dling some exceptions requires that the faulting instruction have no effects. For 
other exceptions, such as floating-point exceptions, the faulting instruction on 
some processors writes its result before the exception can be handled. In such 
cases, the hardware must be prepared to retrieve the source operands, even if the 
destination is identical to one of the source operands. Because floating-point 
operations may run for many cycles, it is highly likely that some other instruction 
may have written the source operands (as we will see in the next section, floating- 
point operations often complete out of order). To overcome this, many recent 
high-performance CPUs have introduced two modes of operation. One mode has 
precise exceptions and the other (fast or performance mode) does not. Of course, 
the precise exception mode is slower, since it allows less overlap among floating- 
point instructions. In some high-performance CPUs, including Alpha 21064, 
Power2, and MIPS R8000, the precise mode is often much slower (> 10 times) 
and thus useful only for debugging of codes. 

Supporting precise exceptions is a requirement in many systems, while in 
others it is "just" valuable because it simplifies the operating system interface. At 
a minimum, any processor with demand paging or IEEE arithmetic trap handlers 
must make its exceptions precise, either in the hardware or with some software 
support. For integer pipelines, the task of creating precise exceptions is easier, 
and accommodating virtual memory strongly motivates the support of precise 
exceptions for memory references. In practice, these reasons have led designers 
and architects to always provide precise exceptions for the integer pipeline. In 
this section we describe how to implement precise exceptions for the MIPS inte- 
ger pipeline. We will describe techniques for handling the more complex chal- 
lenges arising in the FP pipeline in Section A.5. 


Exceptions in MIPS 


Figure A.28 shows the MIPS pipeline stages and which "problem" exceptions 
might occur in each stage. With pipelimng, multiple exceptions may occur in the 
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Pipeline stage Problem exceptions occurring 

















IF Page fault on instruction fetch; misaligned memory access; memory 
protection violation 

ID Undefined or illegal opcode 

EX Arithmetic exception 

MEM Page fault on data fetch; misaligned memory access; memory 
protection violation 

WB None 





Figure A.28 Exceptions that may occur in the MIPS pipeline. Exceptions raised from 
instruction or data memory access account for six out of eight cases. 


same clock cycle because there are multiple instructions in execution. For exam- 
ple, consider this instruction sequence: 





ID IF ID EX MEM WB 





DADD IF ID EX MEM WB 





This pair of instructions can cause a data page fault and an arithmetic exception 
at the same time, since the LD is in the MEM stage while the DADD is in the EX 
stage. This case can be handled by dealing with only the data page fault and then 
restarting the execution. The second exception will reoccur (but not the first, if 
the software is correct), and when the second exception occurs, it can be handled 
independently. 

In reality, the situation is not as straightforward as this simple example. 
Exceptions may occur out of order; that is, an instruction may cause an exception 
before an earlier instruction causes one. Consider again the above sequence of 
instructions, LD followed by DADD The LD can get a data page fault, seen when 
the instruction is in MEM, and the DAD can get an instruction page fault, seen 
when the DADD instruction is in IF. The instruction page fault will actually occur 
first, even though it is caused by a later instruction! 

Since we are implementing precise exceptions, the pipeline is required to 
handle the exception caused by the LD instruction first. To explain how this 
works, let's call the instruction in the position of the LD instruction i, and the 
instruction in the position of the DAD instruction i + 1. The pipeline cannot sim- 
ply handle an exception when it occurs in time, since that will lead to exceptions 
occurring out of the unpipelined order. Instead, the hardware posts all exceptions 
caused by a given instruction in a status vector associated with that instruction. 
The exception status vector is carried along as the instruction goes down the 
pipeline. Once an exception indication is set in the exception status vector, any 
control signal that may cause a data value to be written is turned off (this includes 
both register writes and memory writes). Because a store can cause an exception 
during MEM, the hardware must be prepared to prevent the store from complet- 
ing if it raises an exception. 
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When an instruction enters WB (or is about to leave MEM), the exception sta- 
tus vector is checked. If any exceptions are posted, they are handled in the order in 
which they would occur in time on an unpipelined processor—the exception corre- 
sponding to the earliest instruction (and usually the earliest pipe stage for that 
instruction) is handled first. This guarantees that all exceptions will be seen on 
instruction i before any are seen on / + 1. Of course, any action taken in earlier pipe 
stages on behalf of instruction i may be invalid, but since writes to the register file 
and memory were disabled, no state could have been changed. As we will see in 
Section A.5, maintaining this precise model for FP operations is much harder. 

In the next subsection we describe problems that arise in implementing 
exceptions in the pipelines of processors with more powerful, longer-running 
instructions. 


Instruction Set Complications 


No MIPS instruction has more than one result, and our MIPS pipeline writes that 
result only at the end of an instruction’s execution. When an instruction is guar- 
anteed to complete, it is called committed. In the MIPS integer pipeline, all 
instructions are committed when they reach the end of the MEM stage (or begin- 
ning of WB) and no instruction updates the state before that stage. Thus, precise 
exceptions are straightforward. Some processors have instructions that change 
the state in the middle of the instruction execution, before the instruction and its 
predecessors are guaranteed to complete. For example, autoincrement addressing 
modes in the IA-32 architecture cause the update of registers in the middle of an 
instruction execution. In such a case, if the instruction is aborted because of an 
exception, it will leave the processor state altered. Although we know which 
instruction caused the exception, without additional hardware support the excep- 
tion will be imprecise because the instruction will be half finished. Restarting the 
instruction stream after such an imprecise exception is difficult. Alternatively, we 
could avoid updating the state before the instruction commits, but this may be 
difficult or costly, since there may be dependences on the updated state: Consider 
a VAX instruction that autoincrements the same register multiple times. Thus, to 
maintain a precise exception model, most processors with such instructions have 
the ability to back out any state changes made before the instruction is commit- 
ted. If an exception occurs, the processor uses this ability to reset the state of the 
processor to its value before the interrupted instruction started. In the next sec- 
tion, we will see that a more powerful MIPS floating-point pipeline can introduce 
similar problems, and Section A.7 introduces techniques that substantially com- 
plicate exception handling. 

A related source of difficulties arises from instructions that update memory 
state during execution, such as the string copy operations on the VAX or IBM 360 
(see Appendix J). To make it possible to interrupt and restart these instructions, 
the instructions are defined to use the general-purpose registers as working regis- 
ters. Thus the state of the partially completed instruction is always in the regis- 
ters, which are saved on an exception and restored after the exception, allowing 
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the instruction to continue. In the VAX an additional bit of state records when an 
instruction has started updating the memory state, so that when the pipeline is 
restarted, the CPU knows whether to restart the instruction from the beginning or 
from the middle of the instruction. The IA-32 string instructions also use the reg- 
isters as working storage, so that saving and restoring the registers saves and 
restores the state of such instructions. 

A different set of difficulties arises from odd bits of state that may create 
additional pipeline hazards or may require extra hardware to save and restore. 
Condition codes are a good example of this. Many processors set the condition 
codes implicitly as part of the instruction. This approach has advantages, since 
condition codes decouple the evaluation of the condition from the actual branch. 
However, implicitly set condition codes can cause difficulties in scheduling any 
pipeline delays between setting the condition code and the branch, since most 
instructions set the condition code and cannot be used in the delay slots between 
the condition evaluation and the branch. 

Additionally, in processors with condition codes, the processor must decide 
when the branch condition is fixed. This involves finding out when the condition 
code has been set for the last time before the branch. In most processors with 
implicitly set condition codes, this is done by delaying the branch condition eval- 
uation until all previous instructions have had a chance to set the condition code. 

Of course, architectures with explicitly set condition codes allow the delay 
between condition test and the branch to be scheduled; however, pipeline control 
must still track the last instruction that sets the condition code to know when the 
branch condition is decided. In effect, the condition code must be treated as an 
operand that requires hazard detection for RAW hazards with branches, just as 
MIPS must do on the registers. 

A final thorny area in pipelining is multicycle operations. Imagine trying to 
pipeline a sequence of VAX instructions such as this: 


MOVL R1.R2 smoves between registers 
ADDL3 42(R1),56(R1)+, @(R1) ;adds memory locations 
SUBL2 R2.R3 ;subtracts registers 


MOVC3 @(R1)[R2],74(R2),R3 ;moves a character string 


These instructions differ radically in the number of clock cycles they will require, 
from as low as one up to hundreds of clock cycles. They also require different 
numbers of data memory accesses, from zero to possibly hundreds. The data haz- 
ards are very complex and occur both between and within instructions. The sim- 
ple solution of making all instructions execute for the same number of clock 
cycles is unacceptable because it introduces an enormous number of hazards and 
bypass conditions and makes an immensely long pipeline. Pipelining the VAX at 
the instruction level is difficult, but a clever solution was found by the VAX 8800 
designers. They pipeline the microinstruction execution: a microinstruction is a 
simple instruction used in sequences to implement a more complex instruction 
set. Because the microinstructions are simple (they look a lot like MIPS), the 
pipeline control is much easier. Since 1995, all Intel IA-32 microprocessors have 
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used this strategy of converting the IA-32 instructions into microoperations, and 
then pipelining the microoperations. 

In comparison, load-store processors have simple operations with similar 
amounts of work and pipeline more easily. If architects realize the relationship 
between instruction set design and pipelining, they can design architectures for 
more efficient pipelining. In the next section we will see how the MIPS pipeline 
deals with long-running instructions, specifically floating-point operations. 

For many years the interaction between instruction sets and implementations 
was believed to be small, and implementation issues were not a major focus in 
designing instruction sets. In the 1980s it became clear that the difficulty and 
inefficiency of pipelining could both be increased by instruction set complica- 
tions. In the 1990s, all companies moved to simpler instructions sets with the 
goal of reducing the complexity of aggressive implementations. 


Extending the MIPS Pipeline to Handle Multicycle 
Operations 


We now want to explore how our MIPS pipeline can be extended to handle 
floating-point operations. This section concentrates on the basic approach and the 
design alternatives, closing with some performance measurements of a MIPS 
floating-point pipeline. 

It is impractical to require that all MIPS floating-point operations complete in 
1 clock cycle, or even in 2. Doing so would mean accepting a slow clock, or 
using enormous amounts of logic in the floating-point units, or both. Instead, the 
floating-point pipeline will allow for a longer latency for operations. This is eas- 
ier to grasp if we imagine the floating-point instructions as having the same pipe- 
line as the integer instructions, with two important changes. First, the EX cycle 
may be repeated as many times as needed to complete the operation—the number 
of repetitions can vary for different operations. Second, there may be multiple 
floating-point functional units. A stall will occur if the instruction to be issued 
will either cause a structural hazard for the functional unit it uses or cause a data 
hazard. 

For this section, let's assume that there are four separate functional units in 
our MIPS implementation: 


1. The main integer unit that handles loads and stores, integer ALU operations, 
and branches 

2. FP and integer multiplier 

3. FP adder that handles FP add, subtract, and conversion 

4. FP and integer divider 


If we also assume that the execution stages of these functional units are not pipe- 
lined, then Figure A.29 shows the resulting pipeline structure. Because EX is not 
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pipelined, no other instruction using that functional unit may issue until the pre- 
vious instruction leaves EX. Moreover, if an instruction cannot proceed to the EX 
stage, the entire pipeline behind that instruction will be stalled. 

In reality, the intermediate results are probably not cycled around the EX unit 
as Figure A.29 suggests; instead, the EX pipeline stage has some number of 
clock delays larger than 1. We can generalize the structure of the FP pipeline 
shown in Figure A.29 to allow pipelining of some stages and multiple ongoing 
operations. To describe such a pipeline, we must define both the latency of the 
functional units and also the initiation interval or repeat interval. We define 
latency the same way we defined it earlier: the number of intervening cycles 
between an instruction that produces a result and an instruction that uses the 
result. The initiation or repeat interval is the number of cycles that must elapse 
between issuing two operations of a given type. For example, we will use the 
latencies and initiation intervals shown in Figure A.30. 

With this definition of latency, integer ALU operations have a latency of 0, 
since the results can be used on the next clock cycle, and loads have a latency of 
1, since their results can be used after one intervening cycle. Since most opera- 
tions consume their operands at the beginning of EX, the latency is usually the 
number of stages after EX that an instruction produces a result—for example, 
zero stages for ALU operations and one stage for loads. The primary exception is 
stores, which consume the value being stored 1 cycle later. Hence the latency to a 
store for the value being stored, but not for the base address register, will be 





phic 


FP/integer 
í divider ) 


Figure A.29 The MIPS pipeline with three additional unpipelined, floating-point, 
functional units. Because only one instruction issues on every clock cycle, all instruc- 
tions go through the standard pipeline for integer operations. The floating-point opera- 
tions simply loop when they reach the EX stage. After they have finished the EX stage, 
they proceed to MEM and WB to complete execution. 
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Functional unit Latency Initiation interval 
Integer ALU 0 1 
Data memory (integer and FP loads) 1 1 
FPadd 3 1 
FP multiply (also integer multiply) 6 1 
FP divide (also integer divide) 24 25 





Figure A.30 Latencies and initiation intervals for functional units. 


1 cycle less. Pipeline latency is essentially equal to 1 cycle less than the depth of 
the execution pipeline, which is the number of stages from the EX stage to the 
stage that produces the result. Thus, for the example pipeline just above, the num- 
ber of stages in an FP add is four, while the number of stages in an FP multiply is 
seven. To achieve a higher clock rate, designers need to put fewer logic levels in 
each pipe stage, which makes the number of pipe stages required for more com- 
plex operations larger. The penalty for the faster clock rate is thus longer latency 
for operations. 

The example pipeline structure in Figure A.30 allows up to four outstanding 
FP adds, seven outstanding FP/integer multiplies, and one FP divide. Figure A.31 
shows how this pipeline can be drawn by extending Figure A.29. The repeat 
interval is implemented in Figure A.31 by adding additional pipeline stages, 
which will be separated by additional pipeline registers. Because the units are 
independent, we name the stages differently. The pipeline stages that take multi- 
ple clock cycles, such as the divide unit, are further subdivided to show the 
latency of those stages. Because they are not complete stages, only one operation 
may be active. The pipeline structure can also be shown using the familiar dia- 
grams from earlier in the appendix, as Figure A.32 shows for a set of independent 
FP operations and FP loads and stores. Naturally, the longer latency of the FP 
operations increases the frequency of RAW hazards and resultant stalls, as we will 
see later in this section. 

The structure of the pipeline in Figure A.31 requires the introduction of the 
additional pipeline registers (e.g., A1/A2, A2/A3, A3/A4) and the modification of 
the connections to those registers. The ID/EX register must be expanded to con- 
nect ID to EX, DIV, M1, and Al; we can refer to the portion of the register asso- 
ciated with one of the next stages with the notation ID/EX, ID/DIV, ID/M1], or 
ID/A1. The pipeline register between ID and all the other stages may be thought 
of as logically separate registers and may, in fact, be implemented as separate 
registers. Because only one operation can be in a pipe stage at a time, the control 
information can be associated with the register at the head of the stage. 


Hazards and Forwarding in Longer Latency Pipelines 


There are a number of different aspects to the hazard detection and forwarding 
for a pipeline like that in Figure A.31. 
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Figure A.31 A pipeline that supports multiple outstanding FP operations. The FP multiplier and adder are fully 
pipelined and have a depth of seven and four stages, respectively. The FP divider is not pipelined, but requires 24 
clock cycles to complete.The latency in instructions between the issue of an FP operation and the use of the result of 
that operation without incurring a RAW stall is determined by the number of cycles spent in the execution stages. For 
example, the fourth instruction after an FP add can use the result of the FP add. For integer ALU operations, the 
depth of the execution pipeline is always one and the next instruction can use the results. 














MULD IF ID MI M2 M3 M4 M5 M6 M7 MEM WB 
ADD.D IF ID Al A2 A3 A4 MEM WB 

LD IF ID EX MEM WB 

S.D IF ID EX MEM WB 





Figure A.32 The pipeline timing of a set of independent FP operations.The stages in italics show where data are 
needed, while the stages in bold show where a result is available.The ".D" extension on the instruction mnemonic 
indicates double-precision (64-bit) floating-point operations. FP loads and stores use a 64-bit path to memory so that 
the pipelining timing is just like an integer load or store. 


1. Because the divide unit is not fully pipelined, structural hazards can occur. 
These will need to be detected and issuing instructions will need to be stalled. 


2. Because the instructions have varying running times, the number of register 
writes required in a cycle can be larger than 1. 


3. WAW hazards are possible, since instructions no longer reach WB in order. 
Note that WAR hazards are not possible, since the register reads always occur 
in ID. 
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4. Instructions can complete in a different order than they were issued, causing 
problems with exceptions; we deal with this in the next subsection. 


5. Because of longer latency of operations, stalls for RAW hazards will be more 
frequent. 


The increase in stalls arising from longer operation latencies is fundamentally the 
same as that for the integer pipeline. Before describing the new problems that 
arise in this FP pipeline and looking at solutions, let's examine the potential 
impact of RAW hazards. Figure A.33 shows a typical FP code sequence and the 
resultant stalls. At the end of this section, we'll examine the performance of this 
FP pipeline for our SPEC subset. 

Now look at the problems arising from writes, described as (2) and (3) in the 
earlier list. If we assume the FP register file has one write port, sequences of FP 
operations, as well as an FP load together with FP operations, can cause conflicts 
for the register write port. Consider the pipeline sequence shown in Figure A.34. In 


Clock cycle number 





Instruction 12 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 





LD F4,0(R2) IF ID EX MEM WB 

MULD FO,F4,F6 IF ID stall Ml M2 M3 M4 M5 M6 M7 MEM WB 

ADDD F2,F0,F8 IF stall ID stall stall stall stall stall stall Al A2 A3 A4 MEM WB 
S.D F2,0(R2) IF stall stall stall stall stall stall ID EX stall stall stall MEM 














Figure A.33 A typical FP code sequence showing the stalls arising from RAW hazards. The longer pipeline sub- 
stantially raises the frequency of stalls versus the shallower integer pipeline. Each instruction in this sequence is 
dependent on the previous and proceeds as soon as data are available, which assumes the pipeline has full bypass- 
ing and forwarding.The S.D must be stalled an extra cycle so that its MEM does not conflict with the ADD.D. Extra 
hardware could easily handle this case. 





Clock cycle number 





Instruction 1 2 3 4 5 6 7 8 9 10 11 


MULD F0,F4,F6 IF ID MI M2 M3 M4 M5 M6 M7 MEM WB 
IF ID EX MEM WB 
IF ID EX MEM WB 























ADDD F2.F4.F6 IF ID Al A2 A3 A4 MEM WB 
IF ID EX MEM WB 
IF ID EX MEM WB 
LD F2,0(R2) IF ID EX MEM WB 





Figure A.34 Three instructions want to perform a write back to the FP register file simultaneously, as shown in 
clock cycle 11.This is not the worst case, since an earlier divide in the FP unit could also finish on the same clock. 
Note that although the MUL.D, ADD. D, and L. D all are in the MEM stage in clock cycle 10, only the L. D actually uses the 
memory, so no structural hazard exists for MEM. 
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clock cycle 11, all three instructions will reach WB and want to write the register 
file. With only a single register file write port, the processor must serialize the 
instruction completion. This single register port represents a structural hazard. We 
could increase the number of write ports to solve this, but that solution may be 
unattractive since the additional write ports would be used only rarely. This is 
because the maximum steady-state number of write ports needed is 1. Instead, we 
choose to detect and enforce access to the write port as a structural hazard. 

There are two different ways to implement this interlock. The first is to track 
the use of the write port in the ID stage and to stall an instruction before it issues, 
just as we would for any other structural hazard. Tracking the use of the write 
port can be done with a shift register that indicates when already-issued instruc- 
tions will use the register file. If the instruction in ID needs to use the register file 
at the same time as an instruction already issued, the instruction in ID is stalled 
for a cycle. On each clock the reservation register is shifted 1 bit. This implemen- 
tation has an advantage: It maintains the property that all interlock detection and 
stall insertion occurs in the ID stage. The cost is the addition of the shift register 
and write conflict logic. We will assume this scheme throughout this section. 

An alternative scheme is to stall a conflicting instruction when it tries to enter 
either the MEM or WB stage. If we wait to stall the conflicting instructions until 
they want to enter the MEM or WB stage, we can choose to stall either instruc- 
tion. A simple, though sometimes suboptimal, heuristic is to give priority to the 
unit with the longest latency, since that is the one most likely to have caused 
another instruction to be stalled for a RAW hazard. The advantage of this scheme 
is that it does not require us to detect the conflict until the entrance of the MEM 
or WB stage, where it is easy to see. The disadvantage is that it complicates pipe- 
line control, as stalls can now arise from two places. Notice that stalling before 
entering MEM will cause the EX, A4, or M7 stage to be occupied, possibly forc- 
ing the stall to trickle back in the pipeline. Likewise, stalling before WB would 
cause MEM to back up. 

Our other problem is the possibility of WAW hazards. To see that these exist, 
consider the example in Figure A.34. Ifthe L. D instruction were issued one cycle 
earlier and had a destination of F2, then it would create a WAW hazard, because it 
would write F2 one cycle earlier than the ADDD. Note that this hazard only 
occurs when the result of the ADDD is overwritten without any instruction ever 
using it! If there were a use of F2 between the ADDD and the L.D, the pipeline 
would need to be stalled for a RAW hazard, and the L.D would not issue until the 
ADD. D was completed. We could argue that, for our pipeline, WAW hazards only 
occur when a useless instruction is executed, but we must still detect them and 
make sure that the result of the L. D appears in F2 when we are done. (As we will 
see in Section A.8, such sequences sometimes do occur in reasonable code.) 

There are two possible ways to handle this WAW hazard. The first approach is 
to delay the issue of the load instruction until the ADD. D enters MEM. The second 
approach is to stamp out the result of the ADDD by detecting the hazard and 
changing the control so that the ADDD does not write its result. Then the L.D can 
issue right away. Because this hazard is rare, either scheme will work fine—you 
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can pick whatever is simpler to implement. In either case, the hazard can be 
detected during ID when the L. D is issuing. Then stalling the L. D or making the 
ADD. D a no-op is easy. The difficult situation is to detect that the L. D might finish 
before the ADD. D, because that requires knowing the length of the pipeline and 
the current position of the ADD. D. Luckily, this code sequence (two writes with no 
intervening read) will be very rare, so we can use a simple solution: If an instruc- 
tion in ID wants to write the same register as an instruction already issued, do not 
issue the instruction to EX. In Section A.7, we will see how additional hardware 
can eliminate stalls for such hazards. First, let's put together the pieces for imple- 
menting the hazard and issue logic in our FP pipeline. 

In detecting the possible hazards, we must consider hazards among FP 
instructions, as well as hazards between an FP instruction and an integer instruc- 
tion. Except for FP loads-stores and FP-integer register moves, the FP and integer 
registers are distinct. All integer instructions operate on the integer registers, 
while the floating-point operations operate only on their own registers. Thus, we 
need only consider FP loads-stores and FP register moves in detecting hazards 
between FP and integer instructions. This simplification of pipeline control is an 
additional advantage of having separate register files for integer and floating- 
point data. (The main advantages are a doubling of the number of registers, with- 
out making either set larger, and an increase in bandwidth without adding more 
ports to either set. The main disadvantage, beyond the need for an extra register 
file, is the small cost of occasional moves needed between the two register sets.) 
Assuming that the pipeline does all hazard detection in ID, there are three checks 
that must be performed before an instruction can issue: 


1. Check for structural hazards—Wait until the required functional unit is not 
busy (this is only needed for divides in this pipeline) and make sure the regis- 
ter write port is available when it will be needed. 


2. Check for a RAW data hazard—Wait until the source registers are not listed as 
pending destinations in a pipeline register that will not be available when this 
instruction needs the result. A number of checks must be made here, depending 
on both the source instruction, which determines when the result will be avail- 
able, and the destination instruction, which determines when the value is 
needed. For example, if the instruction in ID is an FP operation with source reg- 
ister F2, then F2 cannot be listed as a destination in ID/A1, A1/A2, or A2/A3, 
which correspond to FP add instructions that will not be finished when the 
instruction in ID needs a result. (ID/A1 is the portion of the output register of 
ID that is sent to Al.) Divide is somewhat more tricky, if we want to allow the 
last few cycles of a divide to be overlapped, since we need to handle the case 
when a divide is close to finishing as special. In practice, designers might 
ignore this optimization in favor of a simpler issue test. 


3. Check for a WAW data hazard—Determine if any instruction in Al, .** » A4, 
D, M1, . . . , M7 has the same register destination as this instruction. If so, 
stall the issue of the instruction in ID. 
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Although the hazard detection is more complex with the multicycle FP opera- 
tions, the concepts are the same as for the MIPS integer pipeline. The same is true 
for the forwarding logic. The forwarding can be implemented by checking if the 
destination register in any of EX/MEM, A4/MEM, M7/MEM, D/MEM, or 
MEM/WB registers is one of the source registers of a floating-point instruction. 
If so, the appropriate input multiplexer will have to be enabled so as to choose the 
forwarded data. In the exercises, you will have the opportunity to specify the 
logic for the RAW and WAW hazard detection as well as for forwarding. 

Multicycle FP operations also introduce problems for our exception mecha- 
nisms, which we deal with next. 


Maintaining Precise Exceptions 


Another problem caused by these long-running instructions can be illustrated 
with the following sequence of code: 


DIV.D FO,F2,F4 
ADD.D F10,F10,F8 
SUBD F12.F12.F14 


This code sequence looks straightforward; there are no dependences. A problem 
arises, however, because an instruction issued early may complete after an 
instruction issued later. In this example, we can expect ADD. D and SUB. D to com- 
plete before the DIV.D completes. This is called out-of-order completion and is 
common in pipelines with long-running operations (see Section A.7). Because 
hazard detection will prevent any dependence among instructions from being 
violated, why is out-of-order completion a problem? Suppose that the SUBD 
causes a floating-point arithmetic exception at a point where the ADD. D has com- 
pleted but the DIV.D has not. The result will be an imprecise exception, some- 
thing we are trying to avoid. It may appear that this could be handled by letting 
the floating-point pipeline drain, as we do for the integer pipeline. But the excep- 
tion may be in a position where this is not possible. For example, if the DIVD 
decided to take a floating-point-arithmetic exception after the add completed, we 
could not have a precise exception at the hardware level. In fact, because the 
ADD. D destroys one of its operands, we could not restore the state to what it was 
before the DIV. D, even with software help. 

This problem arises because instructions are completing in a different order 
than they were issued. There are four possible approaches to dealing with out-of- 
order completion. The first is to ignore the problem and settle for imprecise excep- 
tions. This approach was used in the 1960s and early 1970s. It is still used in some 
supercomputers, where certain classes of exceptions are not allowed or are handled 
by the hardware without stopping the pipeline. It is difficult to use this approach in 
most processors built today because of features such as virtual memory and the 
IEEE floating-point standard, which essentially require precise exceptions through 
a combination of hardware and software. As mentioned earlier, some recent proces- 
sors have solved this problem by introducing two modes of execution: a fast, but 
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possibly imprecise mode and a slower, precise mode. The slower precise mode is 
implemented either with a mode switch or by insertion of explicit instructions that 
test for FP exceptions. In either case the amount of overlap and reordering permit- 
ted in the FP pipeline is significantly restricted so that effectively only one FP 
instruction is active at atime. This solution is used in the DEC Alpha 21064 and 
21164, in the IBM Powerl and Power2, and in the MIPS R8000. 

A second approach is to buffer the results of an operation until all the opera- 
tions that were issued earlier are complete. Some CPUs actually use this solution, 
but it becomes expensive when the difference in running times among operations 
is large, since the number of results to buffer can become large. Furthermore, 
results from the queue must be bypassed to continue issuing instructions while 
waiting for the longer instruction. This requires a large number of comparators 
and a very large multiplexer. 

There are two viable variations on this basic approach. The first is a history 

file, used in the CYBER 180/990. The history file keeps track of the original 
values of registers. When an exception occurs and the state must be rolled back 
earlier than some instruction that completed out of order, the original value of 
the register can be restored from the history file. A similar technique is used for 
autoincrement and autodecrement addressing on processors like VAXes. 
Another approach, the future file, proposed by Smith and Pleszkun [1988], 
keeps the newer value of a register; when all earlier instructions have com- 
pleted, the main register file is updated from the future file. On an exception, 
the main register file has the precise values for the interrupted state. In Chapter 
2, we will see extensions of this idea, which are used in processors such as the 
PowerPC 620 and the MIPS R10000 to allow overlap and reordering while pre- 
serving precise exceptions. 

A third technique in use is to allow the exceptions to become somewhat 
imprecise, but to keep enough information so that the trap-handling routines can 
create a precise sequence for the exception. This means knowing what operations 
were in the pipeline and their PCs. Then, after handling the exception, the soft- 
ware finishes any instructions that precede the latest instruction completed, and 
the sequence can restart. Consider the following worst-case code sequence: 


Instructioni—A long-running instruction that eventually interrupts execution. 


Instruction, . . . , Instruction‘)—A series of instructions that are not 
completed. 


Instruction,,—An instruction that is finished. 


Given the PCs of all the instructions in the pipeline and the exception return PC, 
the software can find the state of instructioni and instruction,,. Because instruc- 
tion,, has completed, we will want to restart execution at instruction’ j. After 
handling the exception, the software must simulate the execution of instructioni, 

. ., instructional. Then we can return from the exception and restart at instruc- 
tional. The complexity of executing these instructions properly by the handler is 
the major difficulty of this scheme. 
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A.6 


There is an important simplification for simple MIPS-like pipelines: If 
instruction, . . . , instruction,, are all integer instructions, then we know that if 
instruction,, has completed, all of instruction,..., instructional "“*° also com- 
pleted. Thus, only floating-point operations need to be handled. To make this 
scheme tractable, the number of floating-point instructions that can be over- 
lapped in execution can be limited. For example, if we only overlap two instruc- 
tions, then only the interrupting instruction need be completed by software. This 
restriction may reduce the potential throughput if the FP pipelines are deep or if 
there are a significant number of FP functional units. This approach is used in the 
SPARC architecture to allow overlap of floating-point and integer operations. 

The final technique is a hybrid scheme that allows the instruction issue to 
continue only if it is certain that all the instructions before the issuing instruction 
will complete without causing an exception. This guarantees that when an excep- 
tion occurs, no instructions after the interrupting one will be completed and all of 
the instructions before the interrupting one can be completed. This sometimes 
means stalling the CPU to maintain precise exceptions. To make this scheme 
work, the floating-point functional units must determine if an exception is possi- 
ble early in the EX stage (in the first 3 clock cycles in the MIPS pipeline), so as to 
prevent further instructions from completing. This scheme is used in the MIPS 
R2000/3000, the R4000, and the Intel Pentium. It is discussed further in 
Appendix I. 


Performance ofa MIPS FP Pipeline 


The MIPS FP pipeline of Figure A.31 on page A-50 can generate both structural 
stalls for the divide unit and stalls for RAW hazards (it also can have WAW haz- 
ards, but this rarely occurs in practice). Figure A.35 shows the number of stall 
cycles for each type of floating-point operation on a per-instance basis (i.e., the 
first bar for each FP benchmark shows the number of FP result stalls for each FP 
add, subtract, or convert). As we might expect, the stall cycles per operation track 
the latency of the FP operations, varying from 46% to 59% of the latency of the 
functional unit. 

Figure A.36 gives the complete breakdown of integer and floating-point stalls 
for five SPECfp benchmarks. There are four classes of stalls shown: FP result stalls, 
FP compare stalls, load and branch delays, and floating-point structural delays. The 
compiler tries to schedule both load and FP delays before it schedules branch 
delays. The total number of stalls per instruction varies from 0.65 to 1.21. 


Putting It All Together:The MIPS R4000 Pipeline 


In this section we look at the pipeline structure and performance of the MIPS 
R4000 processor family, which includes the 4400. The R4000 implements 
MIPS64 but uses a deeper pipeline than that of our five-stage design both for 
integer and FP programs. This deeper pipeline allows it to achieve higher clock 
rates by decomposing the five-stage integer pipeline into eight stages. Because 
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Figure A.35 Stalls per FP operation for each major type of FP operation for the 
SPEC89 FP benchmarks. Except for the divide structural hazards, these data do not 
depend on thefrequency of an operation, only on its latency and the number of cycles 
before the result is used. The number of stalls from RAW hazards roughly tracks the 
latency of the FP unit. For example, the average number of stalls per FP add, subtracter 
convert is 1.7 cycles, or 56% of the latency (3 cycles). Likewise, the average number of 
stalls for multiplies and divides are 2.8 and 14.2, respectively, or 46% and 59% of the 
corresponding latency. Structural hazards for divides are rare, since the divide fre- 
quency is low. 


cache access is particularly time critical, the extra pipeline stages come from 
decomposing the memory access. This type of deeper pipelining is sometimes 
called superpipelining. 

Figure A.37 shows the eight-stage pipeline structure using an abstracted ver- 
sion of the data path. Figure A.38 shows the overlap of successive instructions in 
the pipeline. Notice that although the instruction and data memory occupy multi- 
ple cycles, they are fully pipelined, so that a new instruction can start on every 
clock. In fact, the pipeline uses the data before the cache hit detection is com- 
plete; Chapter 5 discusses how this can be done in more detail. 

The function of each stage is as follows: 


e IF—First half of instruction fetch; PC selection actually happens here, 
together with initiation of instruction cache access. 


e IS—Second half of instruction fetch, complete instruction cache access. 
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Figure A.36 The stalls occurring for the MIPS FP pipeline for five of the SPEC89 FP 
benchmarks. The total number of stalls per instruction ranges from 0.65 for su2cor to 
1.21 for doduc, with an average of 0.87. FP result stalls dominate in all cases, with an 
average of 0.71 stalls per instruction, or 82% of the stalled cycles. Compares generate 
an average of 0.1 stalls per instruction and are the second largest source. The divide 
structural hazard is only significant for doduc. 


IF IS RF EX DF DS TC WB 


Instruction memory 


Figure A.37 The eight-stage pipeline structure of the R4000 uses pipelined instruc- 
tion and data caches. The pipe stages are labeled and their detailed function is 
described in the text.The vertical dashed lines represent the stage boundaries as well 
as the location of pipeline latches.The instruction is actually available at the end of IS, 
but the tag check is done in FF, while the registers are fetched. Thus, we show the 
instruction memory as operating through RF.The TC stage is needed for data memory 
access, since we cannot write the data into the register until we know whether the 
cache access was a hit or not. 
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e RF—Instruction decode and register fetch, hazard checking, and also instruc- 
tion cache hit detection. 


e EX—Execution, which includes effective address calculation, ALU opera- 
tion, and branch-target computation and condition evaluation. 


e DF—Data fetch, first half of data cache access. 
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Figure A.38 The structure of the R4000 integer pipeline leads to a 2-cycle load delay. A 2-cycle delay is possible 
because the data value is available at the end of DS and can be bypassed. If the tag check in TC indicates a miss, the 


pipeline is backed up a cycle, when the correct data are available. 





Clock number 

















Instruction number 1 2 3 4 5 6 7 8 9 
LD RIl,... IF IS RF EX DF DS TC WB 

DADD R2.R1,... IF IS RF stall stall EX DF DS 
DSUB R3.R1,... IF IS stall stall RF EX DF 
OR R4.R1,... IF stall stall IS RF EX 





Figure A.39 A load instruction followed by an immediate use results in a 2-cycle stall. Normal forwarding paths 
can be used after two cycles, so the DADD and DSLB get the value by forwarding after the stall.The OR instruction gets 
the value from the register file. Since the two instructions after the load could be independent and hence not stall, 


the bypass can be to instructions that are 3 or 4 cycles after the load. 


e DS—Second half of data fetch, completion of data cache access. 


e TC—Tag check, determine whether the data cache access hit. 


e WB—Write back for loads and register-register operations. 


In addition to substantially increasing the amount of forwarding required, this 
longer-latency pipeline increases both the load and branch delays. Figure A.38 
shows that load delays are 2 cycles, since the data value is available at the end of 
DS. Figure A.39 shows the shorthand pipeline schedule when a use immediately 
follows a load. It shows that forwarding is required for the result of a load 


instruction to a destination that is 3 or 4 cycles later. 
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Figure A.40 The basic branch delay is 3 cycles, since she condition evaluation is performed during EX. 


Figure A.40 shows that the basic branch delay is 3 cycles, since the branch 
condition is computed during EX. The MIPS architecture has a single-cycle 
delayed branch. The R4000 uses a predicted-not-taken strategy for the remaining 
2 cycles of the branch delay. As Figure A.41 shows untaken branches are simply 
l-cycle delayed branches, while taken branches have a l-cycle delay slot 
followed by 2 idle cycles. The instruction set provides a branch-likely instruction, 
which we described earlier and which helps in filling the branch delay slot. Pipe- 
line interlocks enforce both the 2-cycle branch stall penalty on a taken branch and 
any data hazard stall that arises from use of a load result. 

In addition to the increase in stalls for loads and branches, the deeper pipeline 
increases the number of levels of forwarding for ALU operations. In our MIPS 
five-stage pipeline, forwarding between two register-register ALU instructions 
could happen from the ALU/MEM or the MEM/WB registers. In the R4000 pipe- 
line, there are four possible sources for an ALU bypass: EX/DF, DF/DS, DS/TC, 
andTO/WB. 


The Floating-Point Pipeline 


The R4000 floating-point unit consists of three functional units: a floating-point 
divider, a floating-point multiplier, and a floating-point adder. The adder logic is 
used on the final step of a multiply or divide. Double-precision FP operations can 
take from 2 cycles (for a negate) up to 112 cycles for a square root. In addition, 
the various units have different initiation rates. The floating-point functional unit 
can be thought of as having eight different stages, listed in Figure A.42; these 
stages are combined in different orders to execute various FP operations. 
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Instruction number 1 2 3 4 5 6 7 8 9 

Branch instruction IF IS RF EX DF DS TC WB 

Delay slot IF IS RF EX DF DS TC WB 

Stall stall stall stall stall stall stall stall 

Stall stall stall stall stall stall stall 

Branch target IF IS RF EX DF 
Clock number 

Instruction number 1 2 3 4 5 6 7 8 9 

Branch instruction IF IS RF EX DF DS TC WB 

Delay slot IF IS RF EX DF DS TC WB 

Branch instruction + 2 IF IS RF EX DF DS TC 

Branch instruction + 3 IF IS RF EX DF DS 





Figure A.41 A taken branch, shown in the top portion of the figure, has a 1-cycle delay slot followed bya2-cycle 
stall, while an untaken branch, shown in the bottom portion, has simply a 1-cycle delay slot.The branch instruc- 
tion can be an ordinary delayed branch or a branch-likely, which cancels the effect of the instruction in the delay slot 


if the branch is untaken. 





























Stage Functional unit Description 

A FP adder Mantissa ADD stage 

D FP divider Divide pipeline stage 

E FP multiplier Exception test stage 

M FP multiplier First stage of multiplier 

N FP multiplier Second stage of multiplier 
R FP adder Rounding stage 

S FP adder Operand shift stage 

u Unpack FP numbers 





Figure A.42 The eight stages used in the R4000 floating-point pipelines. 


There is a single copy of each of these stages, and various instructions may 
use a stage zero or more times and in different orders. Figure A.43 shows the 
latency, initiation rate, and pipeline stages used by the most common double- 


precision FP operations. 


From the information in Figure A.43, we can determine whether a sequence 
of different, independent FP operations can issue without stalling. If the timing of 
the sequence is such that a conflict occurs for a shared pipeline stage, then a stall 
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FP instruction Latency Initiation interval Pipe stages 

Add, subtract 4 3 U,S+A,A+R,R+S 

Multiply 8 4 U,E+M,M,M,M,N,N+A,R 

Divide 36 35 U, A,R, D”, D +A, D +R, D +A, D +R, A,R 
Square root 112 111 U, E, (A+R)'", A, R 

Negate 2 1 U, S 

Absolute value 2 1 U, S 

FP compare 3 2 U,A,R 





Figure A.43 The latencies and initiation intervals for the FP operations both depend on the FP unit stages thata 
given operation must use. The latency values assume that the destination instruction is an FP operation; the laten- 
cies are 1 cycle less when the destination is a store. The pipe stages are shown in the order in which they are used for 
any operation.The notation S + A indicates a clock cycle in which both the S and A stages are used.The notation D”? 
indicates that the D stage is used 28 times in a row. 














Clock cycle 

Operation Issue/stall 0 1 2 3 4 5 6 7 8 9 10 11 12 
Multiply Issue U E+MM M M NN+AR 
Add Issue U S+tA A+R R+S 

Issue U S+tA A+R R+S 

Issue U S+A A+R R+S 

Stall U S+A A+R R+S 

Stall U S+A A+R R+S 

Issue U S+A A+R R+S 

Issue U S+A A+R R+S 


Figure A.44 An FP multiply issued at clock 0 is followed by a single FP add issued between clocks 1 and 7. The 
second column indicates whether an instruction of the specified type stalls when it is issued n cycles later, where nis 
the clock cycle number in which the U stage of the second instruction occurs. The stage or stages that cause a stall 
are highlighted. Note that this table deals with only the interaction between the multiply and one add issued 
between clocks 1 and 7. In this case, the add will stall if it is issued 4 or 5 cycles after the multiply; otherwise, it issues 
without stalling. Notice that the add will be stalled for 2 cycles if it issues in cycle 4 since on the next clock cycle it will 
still conflict with the multiply; if, however, the add issues in cycle 5, it will stall for only 1 clock cycle, since that will 
eliminate the conflicts. 


will be needed. Figures A.44, A.45, A.46, and A.47 show four common possible 
two-instruction sequences: a multiply followed by an add, an add followed by a 
multiply, a divide followed by an add, and an add followed by a divide. The fig- 
ures show all the interesting starting positions for the second instruction and 
whether that second instruction will issue or stall for each position. Of course, 
there could be three instructions active, in which case the possibilities for stalls 
are much higher and the figures more complex. 
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Clock cycle 
Operation Issue/stall 01 2 3 4 5 6 7 8 9101112 
Add Issue U S+AA+RR+S 
Multiply Issue U E + MM M M N NAR 
Issue U M M MMN N+AR 





Figure A.45 A multiply issuing after an add can always proceed without stalling, since the shorter instruction 
clears the shared pipeline stages before the longer instruction reaches them. 








Clock cycle 
Operation Issue/stall 25 26 27 28 29 30 31 32 33 34 35 36 
Divide Issued in D D D D D D+A D+R D+A D+R A R 
cycle 0. .. 
Add Issue U S+A A+R R+S 
Issue U S+A A+R R+S 
Stall U S+A A+R R+S 
Stall U S+A A+R R+S 
Stall U S+A A+R R+S 
Stall U S+A A+R R+S 
Stall U S+A A+R R+S 
Stall U S+A A+R R+S 
Issue U S+A A+R 
Issue U S+A 
Issue U 


Figure A.46 An FP divide can cause a stall for an add that starts near the end of the divide. The divide starts at 
cycle 0 and completes at cycle 35; the last 10 cycles of the divide are shown. Since the divide makes heavy use ofthe 
rounding hardware needed by the add, it stalls an add that starts in any of cycles 28-33. Notice the add starting in 
cycle 28 will be stalled until cycle 36. Ifthe add started right after the divide, it would not conflict, since the add could 
complete before the divide needed the shared stages, just as we saw in Figure A.45 for a multiply and add. As in the 
earlier figure, this example assumes exactly one add that reaches the U stage between clock cycles 26 and 35. 


Performance of the R4000 Pipeline 


In this section we examine the stalls that occur for the SPEC92 benchmarks when 
running on the R4000 pipeline structure. There are four major causes of pipeline 


stalls or losses: 


1. Load stalls—Delays arising from the use of a load result 1 or 2 cycles after 
the load 
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Clock cycle 
Operation Issue/stall 0 1 2 3 4 5 6 7 8 9 10 11 12 
Add Issue u S+A A+R RES 
Divide Stall U A R D D D D D D D D D 
Issue U A R D D D D D D D D 
Issue U A R D D D D D D D 





Figure A.47 A double-precision add is followed by a double-precision divide. If the divide starts 1 cycle after the 
add, the divide stalls, but after that there is no conflict. 
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Figure A.48 The pipeline CPI for 10 of the SPEC92 benchmarks, assuming a perfect 
cache. The pipeline CPI varies from 12 to 2.8. The leftmost five programs are integer 
programs, and branch delays are the major CPI contributor for these.The rightmost five 
programs are FP, and FP result stalls are the major contributor for these. Figure A.49 
shows the numbers used to construct this plot. 


2. Branch stalls—2-cycle stall on every taken branch plus unfilled or canceled 
branch delay slots 


3. FP result stalls—Stalls because of RAW hazards for an FP operand 


4. FP structural stalls—Delays because of issue restrictions arising from con- 
flicts for functional units in the FP pipeline 


Figure A.48 shows the pipeline CPI breakdown for the R4000 pipeline for the 10 

SPEC92 benchmarks. Figure A.49 shows the same data but in tabular form. 
From the data in Figures A.48 and A.49, we can see the penalty of the deeper 

pipelining. The R4000's pipeline has much longer branch delays than the classic 
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Benchmark Pipeline CPI Load stalls Branch stalls FP result stalls FP structural stalls 
compress 1.20 0.14 0.06 0.00 0.00 
eqntott 1.88 0.27 0.61 0.00 0.00 
espresso 1.42 0.07 0.35 0.00 0.00 
gcc 1.56 0.13 0.43 0.00 0.00 
li 1.64 0.18 0.46 0.00 0.00 
Integer average 1.54 0.16 0.38 0.00 0.00 
doduc 2.84 0.01 0.22 1.39 0.22 
mdljdp2 2.66 0.01 0.31 1.20 0.15 
ear 2.17 0.00 0.46 0.59 0.12 
hydro2d 2.53 0.00 0.62 0.75 0.17 
su2cor 2.18 0.02 0.07 0.84 0.26 
FP average 2.48 0.01 0.33 0.95 0.18 
Overall average 2.00 0.10 0.36 0.46 0.09 





Figure A.49 The total pipeline CPI and the contributions of the four major sources of stalls are shown. The major 
contributors are FP result stalls (both for branches and for FP inputs) and branch stalls, with loads and FP structural 
stalls adding less. 


five-stage pipeline. The longer branch delay substantially increases the cycles 
spent on branches, especially for the integer programs with a higher branch fre- 
quency. An interesting effect for the FP programs is that the latency of the FP 
functional units leads to more result stalls than the structural hazards, which arise 
both from the initiation interval limitations and from conflicts for functional units 
from different FP instructions. Thus, reducing the latency of FP operations 
should be the first target, rather than more pipelining or replication of the func- 
tional units. Of course, reducing the latency would probably increase the struc- 
tural stalls, since many potential structural stalls are hidden behind data hazards. 


A.7 Crosscutting Issues 


RISC Instruction Sets and Efficiency of Pipelining 


We have already discussed the advantages of instruction set simplicity in building 
pipelines. Simple instruction sets offer another advantage: They make it easier to 
schedule code to achieve efficiency of execution in a pipeline. To see this, consider 
a simple example: Suppose we need to add two values in memory and store the 
result back to memory. In some sophisticated instruction sets this will take only a 
single instruction; in others it will take two or three. A typical RISC architecture 
would require four instructions (two loads, an add, and a store). These instructions 
cannot be scheduled sequentially in most pipelines without intervening stalls. 
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With a RISC instruction set, the individual operations are separate instruc- 
tions and may be individually scheduled either by the compiler (using the tech- 
niques we discussed earlier and more powerful techniques discussed in Chapter 
2) or using dynamic hardware scheduling techniques (which we discuss next and 
in further detail in Chapter 2). These efficiency advantages, coupled with the 
greater ease of implementation, appear to be so significant that almost all recent 
pipelined implementations of complex instruction sets actually translate their 
complex instructions into simple RISC-like operations, and then schedule and 
pipeline those operations. Chapter 2 shows that both the Pentium II and Pentium 
4 use this approach. 


Dynamically Scheduled Pipelines 


Simple pipelines fetch an instruction and issue it, unless there is a data depen- 
dence between an instruction already in the pipeline and the fetched instruction 
that cannot be hidden with bypassing or forwarding. Forwarding logic reduces 
the effective pipeline latency so that certain dependences do not result in hazards. 
If there is an unavoidable hazard, then the hazard detection hardware stalls the 
pipeline (starting with the instruction that uses the result). No new instructions 
are fetched or issued until the dependence is cleared. To overcome these perfor- 
mance losses, the compiler can attempt to schedule instructions to avoid the haz- 
ard; this approach is called compiler or static scheduling. 

Several early processors used another approach, called dynamic scheduling, 
whereby the hardware rearranges the instruction execution to reduce the stalls. 
This section offers a simpler introduction to dynamic scheduling by explaining 
the scoreboarding technique of the CDC 6600. Some readers will find it easier to 
read this material before plunging into the more complicated Tomasulo scheme, 
which is covered in Chapter 2. 

All the techniques discussed in this appendix so far use in-order instruction 
issue, which means that if an instruction is stalled in the pipeline, no later instruc- 
tions can proceed. With in-order issue, if two instructions have a hazard between 
them, the pipeline will stall, even if there are later instructions that are indepen- 
dent and would not stall. 

In the MIPS pipeline developed earlier, both structural and data hazards were 
checked during instruction decode (ID): When an instruction could execute prop- 
erly, it was issued from ID. To allow an instruction to begin execution as soon as 
its operands are available, even if a predecessor is stalled, we must separate the 
issue process into two parts: checking the structural hazards and waiting for the 
absence of a data hazard. We decode and issue instructions in order. However, we 
want the instructions to begin execution as soon as their data operands are avail- 
able. Thus, the pipeline will do out-of-order execution, which implies out-of- 
order completion. To implement out-of-order execution, we must split the ID 
pipe stage into two stages: 


1. Isswe—Decode instructions, check for structural hazards. 


2. Read operands—Wait until no data hazards, then read operands. 
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The IF stage proceeds the issue stage, and the EX stage follows the read oper- 
ands stage, just as in the MIPS pipeline. As in the MIPS floating-point pipeline, 
execution may take multiple cycles, depending on the operation. Thus, we may 
need to distinguish when an instruction begins execution and when it completes 
execution; between the two times, the instruction is in execution. This allows 
multiple instructions to be in execution at the same time. In addition to these 
changes to the pipeline structure, we will also change the functional unit design 
by varying the number of units, the latency of operations, and the functional unit 
pipelining, so as to better explore these more advanced pipelining techniques. 


Dynamic Scheduling with a Scoreboard 


In a dynamically scheduled pipeline, all instructions pass through the issue stage 
in order (in-order issue); however, they can be stalled or bypass each other in the 
second stage (read operands) and thus enter execution out of order. Scoreboard- 
ing is a technique for allowing instructions to execute out of order when there are 
sufficient resources and no data dependences; it is named after the CDC 6600 
scoreboard, which developed this capability. 

Before we see how scoreboarding could be used in the MIPS pipeline, it is 
important to observe that WAR hazards, which did not exist in the MIPS floating- 
point or integer pipelines, may arise when instructions execute out of order. For 
example, consider the following code sequence: 


DIV.D FO,F2,F4 
ADD.D F10,F0,F8 
SUB.D F8.F8.F14 


There is an antidependence between the ADD. D and the SUB. D: If the pipeline exe- 
cutes the SUBD before the ADDD, it will violate the antidependence, yielding 
incorrect execution. Likewise, to avoid violating output dependences, WAW haz- 
ards (e.g., as would occur if the destination of the SUB. D were F10) must also be 
detected. As we will see, both these hazards are avoided in a scoreboard by stall- 
ing the later instruction involved in the antidependence. 

The goal of a scoreboard is to maintain an execution rate of one instruction 
per clock cycle (when there are no structural hazards) by executing an instruction 
as early as possible. Thus, when the next instruction to execute is stalled, other 
instructions can be issued and executed if they do not depend on any active or 
stalled instruction. The scoreboard takes full responsibility for instruction issue 
and execution, including all hazard detection. Taking advantage of out-of-order 
execution requires multiple instructions to be in their EX stage simultaneously. 
This can be achieved with multiple functional units, with pipelined functional 
units, or with both. Since these two capabilities—pipelined functional units and 
multiple functional units—are essentially equivalent for the purposes of pipeline 
control, we will assume the processor has multiple functional units. 

The CDC 6600 had 16 separate functional units, including 4 floating-point 
units, 5 units for memory references, and 7 units for integer operations. On a 
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processor for the MIPS architecture, scoreboards make sense primarily on the 
floating-point unit since the latency of the other functional units is very small. 
Let's assume that there are two multipliers, one adder, one divide unit, and a sin- 
gle integer unit for all memory references, branches, and integer operations. 
Although this example is simpler than the CDC 6600, it is sufficiently powerful 
to demonstrate the principles without having a mass of detail or needing very 
long examples. Because both MIPS and the CDC 6600 are load-store architec- 
tures, the techniques are nearly identical for the two processors. Figure A.50 
shows what the processor looks like. 

Every instruction goes through the scoreboard, where a record of the data 
dependences is constructed; this step corresponds to instruction issue and replaces 
part of the ID step in the MIPS pipeline. The scoreboard then determines when 
the instruction can read its operands and begin execution. If the scoreboard 
decides the instruction cannot execute immediately, it monitors every change in 
the hardware and decides when the instruction can execute. The scoreboard also 
controls when an instruction can write its result into the destination register. 
Thus, all hazard detection and resolution is centralized in the scoreboard. We will 
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Figure A.50 The basic structure of a MIPS processor with a scoreboard. The score- 
board's function is to control instruction execution (vertical control lines). All data flows 
between the register file and the functional units over the buses (the horizontal lines, 
called trunks in the CDC 6600).There are two FP multipliers, an FP divider, an FP adder, 
and an integer unit. One set of buses (two inputs and one output) serves a group of 
functional units.The details of the scoreboard are shown in Figures A.51 -A.54. 
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see a picture of the scoreboard later (Figure A.51 on pageA-71), but first we 
need to understand the steps in the issue and execution segment of the pipeline. 


Each instruction undergoes four steps in executing. (Since we are concen- 


trating on the FP operations, we will not consider a step for memory access.) 
Let's first examine the steps informally and then look in detail at how the score- 
board keeps the necessary information that determines when to progress from 
one step to the next. The four steps, which replace the ID, EX, and WB steps in 
the standard MIPS pipeline, are as follows: 


1. 


2: 


Issue—If a functional unit for the instruction is free and no other active 
instruction has the same destination register, the scoreboard issues the 
instruction to the functional unit and updates its internal data structure. This 
step replaces a portion of the ID step in the MIPS pipeline. By ensuring that 
no other active functional unit wants to write its result into the destination 
register, we guarantee that WAW hazards cannot be present. If a structural or 
WAW hazard exists, then the instruction issue stalls, and no further instruc- 
tions will issue until these hazards are cleared. When the issue stage stalls, it 
causes the buffer between instruction fetch and issue to fill; if the buffer is a 
single entry, instruction fetch stalls immediately. If the buffer is a queue with 
multiple instructions, it stalls when the queue fills. 


Read operands—The scoreboard monitors the availability of the source oper- 
ands. A source operand is available if no earlier issued active instruction is 
going to write it. When the source operands are available, the scoreboard tells 
the functional unit to proceed to read the operands from the registers and 
begin execution. The scoreboard resolves RAW hazards dynamically in this 
step, and instructions may be sent into execution out of order. This step, 
together with issue, completes the function of the ID step in the simple MIPS 
pipeline. 

Execution—The functional unit begins execution upon receiving operands. 
When the result is ready, it notifies the scoreboard that it has completed 
execution. This step replaces the EX step in the MIPS pipeline and takes mul- 
tiple cycles in the MIPS FP pipeline. 


Write result—Once the scoreboard is aware that the functional unit has com- 
pleted execution, the scoreboard checks for WAR hazards and stalls the com- 
pleting instruction, if necessary. 


A WAR hazard exists if there is a code sequence like our earlier example 
with ADD. D and SUB. D that both use F8. In that example we had the code 


DIV.D FO,F2,F4 
ADD.D F10,F0,F8 
SUB.D F8,F8,F14 


ADD. D has a source operand F8, which is the same register as the destination 
of SUB.D. But ADDD actually depends on an earlier instruction. The score- 
board will still stall the SUB. D in its Write Result stage until ADD. D reads its 
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operands. In general, then, a completing instruction cannot be allowed to 
write its results when 


e there is an instruction that has not read its operands that precedes (i.e., in 
order of issue) the completing instruction, and 


e one of the operands is the same register as the result of the completing in- 
struction. 


If this WAR hazard does not exist, or when it clears, the scoreboard tells the 
functional unit to store its result to the destination register. This step 
replaces the WB step in the simple MIPS pipeline. 


At first glance, it might appear that the scoreboard will have difficulty sepa- 
rating RAW and WAR hazards. 

Because the operands for an instruction are read only when both operands are 
available in the register file, this scoreboard does not take advantage of forward- 
ing. Instead registers are only read when they are both available. This is not as 
large a penalty as you might initially think. Unlike our simple pipeline of earlier, 
instructions will write their result into the register file as soon as they complete 
execution (assuming no WAR hazards), rather than wait for a statically assigned 
write slot that may be several cycles away. The effect is reduced pipeline latency 
and benefits of forwarding. There is still one additional cycle of latency that 
arises since the write result and read operand stages cannot overlap. We would 
need additional buffering to eliminate this overhead. 

Based on its own data structure, the scoreboard controls the instruction pro- 
gression from one step to the next by communicating with the functional units. 
There is a small complication, however. There are only a limited number of 
source operand buses and result buses to the register file, which represents a 
structural hazard. The scoreboard must guarantee that the number of functional 
units allowed to proceed into steps 2 and 4 do not exceed the number of buses 
available. We will not go into further detail on this, other than to mention that the 
CDC 6600 solved this problem by grouping the 16 functional units together into 
four groups and supplying a set of buses, called data trunks, for each group. Only 
one unit in a group could read its operands or write its result during a clock. 

Now let's look at the detailed data structure maintained by a MIPS score- 
board with five functional units. Figure A.51 shows what the Scoreboard's infor- 
mation looks like partway through the execution of this simple sequence of 
instructions: 


LD F6,34(R2) 
LD F2,45(R3) 
MULD FO,F2,F4 
SUBD F8.F6.F2 
DIV.D F10,F0,F6 


ADD.D F6,F8,F2 
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Instruction status 




















Instruction Issue Read operands Execution complete Write result 
L.D F6,34(R2) V V V V 

L.D F2,45(R3) V V V 

MULD FO,F2,F4 V 

SUB.D F8, F6, F2 V 

DIV.D  FIQ.FC),F6 V 





ADD.D F6, F8; F2 








Functional unit status 




















Name Busy Op Fi Fj Fk Qj Qk Rj Rk 

Integer Yes Load F2 R3 No 

Multi Yes Mult FO F2 F4 Integer No Yes 
Mult2 No 

Add Yes Sub F8 F6 F2 Integer Yes No 

Divide Yes Div F10 FO F6 Multl No Yes 








Register result status 





FO F2 F4 F6 F8 FIO F12 F30 





FU Multl Integer Add Divide 


Figure A.51 Components of the scoreboard. Each instruction that has issued or is pending issue has an entry in 
the instruction status table.There is one entry in the functional unit status table for each functional unit. Once an 
instruction issues, the record of its operands is kept in the functional unit status table. Finally, the register result table 
indicates which unit will produce each pending result; the number of entries is equal to the number of registers. The 
instruction status table says that (1) the first L.D has completed and written its result, and (2) the second LD has 
completed execution but has not yet written its result.The MULD, SUB.D, and DIV.D have all issued but are stalled, 
waiting for their operands.The functional unit status says that the first multiply unit is waiting for the integer unit, 
the add unit is waiting for the integer unit, and the divide unit is waiting for the first multiply unit. The ADD. D instruc- 
tion is stalled because of a structural hazard; it will clear when the SUBD completes. If an entry in one of these score- 
board tables is not being used, it is left blank. For example, the Rk field is not used on a load and the Mult2 unit is 
unused, hence their fields have no meaning. Also, once an operand has been read, the Rj and Rk fields are set to No. 
Figure A.54 shows why this last step is crucial. 


There are three parts to the scoreboard: 


1. Instruction status—Indicates which of the four steps the instruction is in. 


2. Functional unit status—Indicates the state of the functional unit (FU). There 
are nine fields for each functional unit: 
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Example 


e Busy—Indicates whether the unit is busy or not. 

e Op—Operation to perform in the unit (e.g., add or subtract). 
e Fi—Destination register. 

e Fy, Fk—Source-register numbers. 

e Qj: Qk—Functional units producing source registers Fj, Fk. 


e Rj, Rk—Flags indicating when Fj, Fk are ready and not yet read. Set to No 
after operands are read. 


3. Register result status—Indicates which functional unit will write each register, 
if an active instruction has the register as its destination. This field is set to 
blank whenever there are no pending instructions that will write that register. 


Now let's look at how the code sequence begun in Figure A.51 continues exe- 
cution. After that, we will be able to examine in detail the conditions that the 
scoreboard uses to control execution. 


Assume the following EX cycle latencies (chosen to illustrate the behavior and 
not representative) for the floating-point functional units: Add is 2 clock cycles, 
multiply is 10 clock cycles, and divide is 40 clock cycles. Using the code seg- 
ment in Figure A.51 and beginning with the point indicated by the instruction sta- 
tus in Figure A.51, show what the status tables look like when MULD and DIV.D 
are each ready to go to the Write Result state. 


Answer There are RAW data hazards from the second L. D to MUL D, ADD. D, and SUB. D, 


from MULD to DIV.D, and from SUB.D to ADDD. There is a WAR data hazard 
between DIV.D and ADD. D and SUB. D. Finally, there is a structural hazard on the 
add functional unit for ADD. D and SUB. D. What the tables look like when MULD 
and DIV.D are ready to write their results is shown in Figures A.52 and A.53, 
respectively. 

Now we can see how the scoreboard works in detail by looking at what has to 
happen for the scoreboard to allow each instruction to proceed. Figure A.54 
shows what the scoreboard requires for each instruction to advance and the book- 
keeping action necessary when the instruction does advance. The scoreboard 
records operand specifier information, such as register numbers. For example, we 
must record the source registers when an instruction is issued. Because we refer 
to the contents of a register as Regs[D], where D is a register name, there is no 
ambiguity. For example, Fj [FU]<— SI causes the register name SI to be placed in 
Fj [FU], rather than the contents of register SI. 

The costs and benefits of scoreboarding are interesting considerations. The 
CDC 6600 designers measured a performance improvement of 1.7 for FOR- 
TRAN programs and 2.5 for hand-coded assembly language. However, this was 
measured in the days before software pipeline scheduling, semiconductor main 
memory, and caches (which lower memory access time). The scoreboard on the 
CDC 6600 had about as much logic as one of the functional units, which is sur- 


A.7  Crosscutting Issues A-73 





Instruction status 








































































































Write 
instruction Issue Read operands Execution complete result 
L.D F6,34(R2) v y y y 
LD  F2,45(R3) V vy y V 
MULD FO,F2.F4 V y y 
SUB.D F8,,F6,F2 y V y y 
DIV.D Fit), F0,F6 V 
ADD.D F6,,F8,F2 y ~ N 

Functional unit status 
Name Busy Op Fi Fj Fk Qj Qk Rj Rk 
Integer No 
Multl Yes Mult FO F2 F4 No No 
Mult2 No 
Add Yes Add F6 F8 F2 No No 
Divide Yes Div FAC) FO F6 Multl No Yes 
Register result status 
FO F2 F4 F6 F8 F10 F12 F30 
FU Multl Add Divide 





Figure A.52 Scoreboard tables just before the MUL D goes to write result. The DIV. D has not yet read either of its 
operands, since it has a dependence on the result of the multiply. The ADD.D has read its operands and is in execu- 
tion, although it was forced to wait until the SUBD finished to get the functional unit. ADD. D cannot proceed to write 
result because of the WAR hazard on F6, which is used by the 01 V.D.TheQ fields are only relevant when afunctional 
unit is waiting for another unit. 


prisingly low. The main cost was in the large number of buses—about four times 
as many as would be required if the CPU only executed instructions in order (or 
if it only initiated one instruction per execute cycle). The recently increasing 
interest in dynamic scheduling is motivated by attempts to issue more instruc- 
tions per clock (so the cost of more buses must be paid anyway) and by ideas like 
speculation (explored in Section 4.7) that naturally build on dynamic scheduling. 

A scoreboard uses the available ILP to minimize the number of stalls arising 
from the program's true data dependences. In eliminating stalls, a scoreboard is 
limited by several factors: 
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Write 
Instruction Issue Read operands Execution complete result 
LD  F6,34(R2) Vv V y y 
L.D F2, 45(R3) y y y V 
MULD FO, F2, F4 y V V V 
SUB.D F8,,F6, F2 y y y y 
DIV.D F10.FC),F6 \ V y 
ADD.D F6,F8, F2 K V V 1 
Functional unit status 
Name Busy Op Fi Fj Fk Qj Qk Rj Rk 
Integer No 
Multl No 
Mult2 No 
Add No 
Divide Yes Div F10 FO F6 No No 
Register result status 
FO F2 F4 F6 F8 F10 F12 F30 
FU Divide 





Figure A.53 Scoreboard tables just before the DIV.D goes to write result. ADD.D was able to complete as soon as 
DIV.D passed through read operands and got a copy of F6. Only the DIV.D remains to finish. 


1. 


The amount of parallelism available among the instructions—This deter- 
mines whether independent instructions can be found to execute. If each 
instruction depends on its predecessor, no dynamic scheduling scheme can 
reduce stalls. If the instructions in the pipeline simultaneously must be cho- 
sen from the same basic block (as was true in the 6600), this limit is likely to 
be quite severe. 


The number of scoreboard entries—This determines how far ahead the pipe- 
line can look for independent instructions. The set of instructions examined 
as candidates for potential execution is called the window. The size of the 
scoreboard determines the size of the window. In this section, we assume a 
window does not extend beyond a branch, so the window (and the score- 
board) always contains straight-line code from a single basic block. Chapter 2 
shows how the window can be extended beyond a branch. 
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Instruction status 


Wait until Bookkeeping 





Issue 


Not Busy [FU] and not Result [D] Busy[FU]<-yes; Op[FU]<op; Fi[FU]<D; 


F{[FU}<-S1; Fk[FU]<-S2; 
Qj<Result[S1]; Qke Result[S2]; 
Rjq not Qj; Rk not Qk; Result [D]<FU; 





Read operands 


Rj and Rk Rj No; Rkė No; Q]<-0; Qk<0 





Execution complete 


Functional unit done 





Write result 


Vf((FILF | # FiJ FU] or Rif] =No) & Vf(if QJ ]=FU then Rjlf]<Yes) ; 
(Fk{f | # Fi[FU] or RkļĮ/ ] = No)) V/F Qk[FI=FU then Rkif]<Yes) ; 


Result [Fi[FU]]< 0; Busy[FU]< No 





Figure A.54 Required checks and bookkeeping actions for each step in instruction execution. FU stands for the 
functional unit used by the instruction, D is the destination register name, S1 and S2 are the source register names, 
and op is the operation to be done.To access the scoreboard entry named Fj for functional unit FU we use the nota- 
tion Fj[FU].Result[D] is the name of the functional unit that will write register D.The test on the write result case pre- 
vents the write when there is a WAR hazard, which exists if another instruction has this instruction's destination 
(Fi[FU]) as a source (Fj[f] or Fk(f]) and if some other instruction has written the register (Rj = Yes or Rk = Yes).The vari- 
able fis used for any functional unit. 


A.8 


Pitfall 


3. The number and types of functional units—This determines the importance of 
structural hazards, which can increase when dynamic scheduling is used. 


4. The presence of antidependences and output dependences—These lead to 
WAR and WAW stalls. 


Chapters 2 and 3 focus on techniques that attack the problem of exposing and 
better utilizing available ILP. The second and third factors can be attacked by 
increasing the size of the scoreboard and the number of functional units; how- 
ever, these changes have cost implications and may also affect cycle time. WAW 
and WAR hazards become more important in dynamically scheduled processors 
because the pipeline exposes more name dependences. WAW hazards also 
become more important if we use dynamic scheduling with a branch-prediction 
scheme that allows multiple iterations of a loop to overlap. 


Fallacies and Pitfalls 


Unexpected execution sequences may cause unexpected hazards. 


At first glance, WAW hazards look like they should never occur in a code 
sequence because no compiler would ever generate two writes to the same regis- 
ter without an intervening read. But they can occur when the sequence is unex- 
pected. For example, the first write might be in the delay slot of a taken branch 
when the scheduler thought the branch would not be taken. Here is the code 
sequence that could cause this: 
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BNEZ Rl, foo 
DIV.D FO0O,F2,F4; moved into delay slot 
;from fall through 


foo: L.D F0,qrs 


If the branch is taken, then before the DIV.D can complete, the LD will reach 
WB, causing a WAW hazard. The hardware must detect this and may stall the 
issue of the L.D. Another way this can happen is if the second write is in a trap 
routine. This occurs when an instruction that traps and is writing results contin- 
ues and completes after an instruction that writes the same register in the trap 
handler. The hardware must detect and prevent this as well. 


Pitfall Extensive pipelining can impact other aspects of a design, leading to overall worse 


cost-performance. 


The best example of this phenomenon comes from two implementations of the 
VAX, the 8600 and the 8700. When the 8600 was initially delivered, it had a 
cycle time of 80 ns. Subsequently, a redesigned version, called the 8650, with a 
55 ns clock was introduced. The 8700 has a much simpler pipeline that operates 
at the microinstruction level, yielding a smaller CPU with a faster clock cycle of 
45 ns. The overall outcome is that the 8650 has a CPI advantage of about 20%, 
but the 8700 has a clock rate that is about 20% faster. Thus, the 8700 achieves the 
same performance with much less hardware. 


Pitfall Evaluating dynamic or static scheduling on the basis of unoptimized code. 


Unoptimized code—containing redundant loads, stores, and other operations that 
might be eliminated by an optimizer—is much easier to schedule than "tight" 
optimized code. This holds for scheduling both control delays (with delayed 
branches) and delays arising from RAW hazards. In gcc running on an R3000, 
which has a pipeline almost identical to that of Section A. 1, the frequency of idle 
clock cycles increases by 18% from the unoptimized and scheduled code to the 
optimized and scheduled code. Of course, the optimized program is much faster, 
since it has fewer instructions. To fairly evaluate a compile time scheduler or run 
time dynamic scheduling, you must use optimized code, since in the real system 
you will derive good performance from other optimizations in addition to sched- 
uling. 


Concluding Remarks 


At the beginning of the 1980s, pipelining was a technique reserved primarily for 
supercomputers and large multimillion dollar mainframes. By the mid-1980s, the 
first pipelined microprocessors appeared and helped transform the world of com- 
puting, allowing microprocessors to bypass minicomputers in performance and 
eventually to take on and outperform mainframes. By the early 1990s, high-end 
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embedded microprocessors embraced pipelining, and desktops were headed 
toward the use of the sophisticated dynamically scheduled, multiple-issue 
approaches discussed in Chapter 2. The material in this appendix, which was 
considered reasonably advanced for graduate students when this text first 
appeared in 1990, is now considered basic undergraduate material and can be 
found in processors costing less than $10! 


Historical Perspective and References 


Section K.4 on the companion CD features a discussion on the development of 
pipelining and instruction-level parallelism. We provide numerous references for 
further reading and exploration of these topics. 
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An Add the number in storage location n into the accumulator. 

En Ifthe number in the accumulator is greater than or equal to 
zero execute next the order which stands in storage location n; 
otherwise proceed serially. 

Z Stop the machine and ring the warning bell. 


Wilkes and Renwick 
Selection from the List of 18 Machine 
Instructions for the EDSAC (1949) 
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Introduction 


In this appendix we concentrate on instruction set architecture—the portion of 
the computer visible to the programmer or compiler writer. Most of this material 
should be review for readers of this book; we include it here for background. This 
appendix introduces the wide variety of design alternatives available to the 
instruction set architect. In particular, we focus on four topics. First, we present a 
taxonomy of instruction set alternatives and give some qualitative assessment of 
the advantages and disadvantages of various approaches. Second, we present and 
analyze some instruction set measurements that are largely independent of a spe- 
cific instruction set. Third, we address the issue of languages and compilers and 
their bearing on instruction set architecture. Finally, the "Putting It All Together" 
section shows how these ideas are reflected in the MIPS instruction set, which is 
typical of RISC architectures. We conclude with fallacies and pitfalls of instruc- 
tion set design. 

To illustrate the principles further, Appendix J also gives four examples of 
general-purpose RISC architectures (MIPS, PowerPC, Precision Architecture, 
SPARC), four embedded RISC processors (ARM, Hitachi SH, MIPS 16, 
Thumb), and three older architectures (80x86, IBM 360/370, and VAX). Before 
we discuss how to classify architectures, we need to say something about instruc- 
tion set measurement. 

Throughout this appendix, we examine a wide variety of architectural mea- 
surements. Clearly, these measurements depend on the programs measured and 
on the compilers used in making the measurements. The results should not be 
interpreted as absolute, and you might see different data if you did the measure- 
ment with a different compiler or a different set of programs. We believe that the 
measurements in this appendix are reasonably indicative of a class of typical 
applications. Many of the measurements are presented using a small set of bench- 
marks, so that the data can be reasonably displayed and the differences among 
programs can be seen. An architect for a new computer would want to analyze a 
much larger collection of programs before making architectural decisions. The 
measurements shown are usually dynamic—that is, the frequency of a measured 
event is weighed by the number of times that event occurs during execution of the 
measured program. 

Before starting with the general principles, let's review the three application 
areas from Chapter 1. Desktop computing emphasizes performance of programs 
with integer and floating-point data types, with little regard for program size or 
processor power consumption. For example, code size has never been reported in 
the five generations of SPEC benchmarks. Servers today are used primarily for 
database, file server, and Web applications, plus some time-sharing applications 
for many users. Hence, floating-point performance is much less important for 
performance than integers and character strings, yet virtually every server proces- 
sor still includes floating-point instructions. Embedded applications value cost 
and power, so code size is important because less memory is both cheaper and 
lower power, and some classes of instructions (such as floating point) may be 
optional to reduce chip costs. 


B.2 Classifying Instruction Set Architectures ¢ B-3 


Thus, instruction sets for all three applications are very similar. In fact, the 
MIPS architecture that drives this appendix has been used successfully in desk- 
tops, servers, and embedded applications. 

One successful architecture very different from RISC is the 80x86 (see 
Appendix J). Surprisingly, its success does not necessarily belie the advantages 
of a RISC instruction set. The commercial importance of binary compatibility 
with PC software combined with the abundance of transistors provided by 
Moore's Law led Intel to use a RISC instruction set internally while supporting 
an 80x86 instruction set externally. Recent 80x86 microprocessors, such as the 
Pentium 4, use hardware to translate from 80x86 instructions to RISC-like 
instructions and then execute the translated operations inside the chip. They 
maintain the illusion of 80x86 architecture to the programmer while allowing the 
computer designer to implement a RISC-style processor for performance. 

Now that the background is set, we begin by exploring how instruction set 
architectures can be classified. 


Classifying Instruction Set Architectures 


The type of internal storage in a processor is the most basic differentiation, so in 
this section we will focus on the alternatives for this portion of the architecture. 
The major choices are a stack, an accumulator, or a set of registers. Operands 
may be named explicitly or implicitly: The operands in a stack architecture are 
implicitly on the top of the stack, and in an accumulator architecture one operand 
is implicitly the accumulator. The general-purpose register architectures have 
only explicit operands—either registers or memory locations. Figure B.1 shows a 
block diagram of such architectures, and Figure B.2 shows how the code 
sequence C = A + B would typically appear in these three classes of instruction 
sets. The explicit operands may be accessed directly from memory or may need 
to be first loaded into temporary storage, depending on the class of architecture 
and choice of specific instruction. 

As the figures show, there are really two classes of register computers. One 
class can access memory as part of any instruction, called register-memory archi- 
tecture, and the other can access memory only with load and store instructions, 
called load-store architecture. A third class, not found in computers shipping 
today, keeps all operands in memory and is called a memory-memory architec- 
ture. Some instruction set architectures have more registers than a single accumu- 
lator, but place restrictions on uses of these special registers. Such an architecture 
is sometimes called an extended accumulator or special-purpose register com- 
puter. 

Although most early computers used stack or accumulator-style archi- 
tectures, virtually every new architecture designed after 1980 uses a load-store 
register architecture. The major reasons for the emergence of general-purpose 
register (GPR) computers are twofold. First, registers—like other forms of stor- 
age internal to the processor—are faster than memory. Second, registers are more 
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Figure B.I Operand locations for four instruction set architecture classes. The arrows indicate whether the oper- 
and is an input or the result of the ALU operation, or both an input and result. Lighter shades indicate inputs, and the 
dark shade indicates the result. In (a), a Top Of Stack register (TOS), points to the top input operand, which is com- 
bined with the operand below.The first operand is removed from the stack, the result takes the place of the second 
operand, and TOS is updated to point to the result. All operands are implicit. In (b), the Accumulator is both an 
implicit input operand and a result. In (c), one input operand is a register, one is in memory, and the result goes to a 
register. All operands are registers in (d) and, like the stack architecture, can be transferred to memory only via sepa- 
rate instructions: push or pop for (a) and load or store for (d). 

















Register 
Stack Accumulator (register-memory) Register (load-store) 
Push A Load A Load R1,A Load RI.A 
Push B Add B Add R3,R1,B Load R2,B 
Add Store C Store R3,C Add R3.R1.R2 
Pop C Store R3,C 





Figure B.2 The code sequence for C = A + B for four classes of instruction sets. Note 
that the Add instruction has implicit operands for stack and accumulator architectures, 
and explicit operands for register architectures. It is assumed that A, B, and C all belong 
in memory and that the values of A and B cannot be destroyed. Figure B.I shows the 
Add operation for each class of architecture. 
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efficient for a compiler to use than other forms of internal storage. For example, 
on a register computer the expression (A*B) - (B*C) - (A*D) may be evaluated 
by doing the multiplications in any order, which may be more efficient because of 
the location of the operands or because of pipelining concerns (see Chapter 2). 
Nevertheless, on a stack computer the hardware must evaluate the expression in 
only one order, since operands are hidden on the stack, and it may have to load an 
operand multiple times. 

More importantly, registers can be used to hold variables. When variables are 
allocated to registers, the memory traffic reduces, the program speeds up (since 
registers are faster than memory), and the code density improves (since a register 
can be named with fewer bits than can a memory location). 

As explained in Section B.8, compiler writers would prefer that all registers 
be equivalent and unreserved. Older computers compromise this desire by dedi- 
cating registers to special uses, effectively decreasing the number of general- 
purpose registers. If the number of truly general-purpose registers is too small, 
trying to allocate variables to registers will not be profitable. Instead, the com- 
piler will reserve all the uncommitted registers for use in expression evaluation. 

How many registers are sufficient? The answer, of course, depends on the 
effectiveness of the compiler. Most compilers reserve some registers for expres- 
sion evaluation, use some for parameter passing, and allow the remainder to be 
allocated to hold variables. Modern compiler technology and its ability to effec- 
tively use larger number of registers has led to an increase in register counts in 
more recent architectures. 

Two major instruction set characteristics divide GPR architectures. Both 
characteristics concern the nature of operands for a typical arithmetic or logical 
instruction (ALU instruction). The first concerns whether an ALU instruction has 
two or three operands. In the three-operand format, the instruction contains one 
result operand and two source operands. In the two-operand format, one of the 
operands is both a source and a result for the operation. The second distinction 
among GPR architectures concerns how many of the operands may be memory 
addresses in ALU instructions. The number of memory operands supported by a 
typical ALU instruction may vary from none to three. Figure B.3 shows combina- 
tions of these two attributes with examples of computers. Although there are 
seven possible combinations, three serve to classify nearly all existing computers. 
As we mentioned earlier, these three are load-store (also called register-register), 
register-memory, and memory-memory. 

Figure B.4 shows the advantages and disadvantages of each of these alterna- 
tives. Of course, these advantages and disadvantages are not absolutes: They are 
qualitative and their actual impact depends on the compiler and implementation 
strategy. A GPR computer with memory-memory operations could easily be 
ignored by the compiler and used as a load-store computer. One of the most per- 
vasive architectural impacts is on instruction encoding and the number of instruc- 
tions needed to perform a task. We see the impact of these architectural 
alternatives on implementation approaches in Appendix A and Chapter 2. 
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Number of Maximum number 
memory of operands 
addresses allowed Type of architecture Examples 
0 3 Load-store Alpha, ARM, MIPS, PowerPC, SPARC, SuperH 
TM32 
1 2 Register-memory IBM 360/370, Intel 80x86, Motorola 68000, 
TI TMS320C54x 
Memory-memory VAX (also has three-operand formats) 
3 3 Memory-memory VAX (also has two-operand formats) 





Figure B.3 Typical combinations of memory operands and total operands per typical ALU instruction with 
examples of computers. Computers with no memory reference per ALU instruction are called load-store or register- 
register computers. Instructions with multiple memory operands per typical ALU instruction are called register- 
memory or memory-memory, according to whether they have one or more than one memory operand. 








Type Advantages Disadvantages 

Register-register Simple, fixed-length instruction encoding. Higher instruction count than architectures with 

(0, 3) Simple code generation model. Instructions | memory references in instructions. More instructions 
take similar numbers of clocks to execute and lower instruction density leads to larger 
(see App. A). programs. 





Register-memory Data can be accessed without a separate load Operands are not equivalent since a source operand in 
(1,2) instruction first. Instruction format tends to be a binary operation is destroyed. Encoding a register 
easy to encode and yields good density. number and a memory address in each instruction 
may restrict the number of registers. Clocks per 
instruction vary by operand location. 





Memory-memory Most compact. Doesn't waste registers for Large variation in instruction size, especially for 

(2, 2) or (3, 3) temporaries. three-operand instructions. In addition, large 
variation in work per instruction. Memory accesses 
create memory bottleneck. (Not used today.) 





Figure B.4 Advantages and disadvantages of the three most common types of general-purpose register com- 
puters. The notation (m, n) means m memory operands and ntotal operands. In general, computers with fewer alter- 
natives simplify the compiler's task since there are fewer decisions for the compiler to make (see Section B8). 
Computers with a wide variety of flexible instruction formats reduce the number of bits required to encode the pro- 
gram.The number of registers also affects the instruction size since you need log2 (number of registers) for each reg- 
ister specifier in an instruction. Thus, doubling the number of registers takes 3 extra bits for a register-register 
architecture, or about 10% of a 32-bit instruction. 


Summary: Classifying Instruction Set Architectures 


Here and at the end of Sections B.3 through B.8 we summarize those characteris- 
tics we would expect to find in a new instruction set architecture, building the 
foundation for the MIPS architecture introduced in Section B.9. From this sec- 
tion we should clearly expect the use of general-purpose registers. Figure B.4, 
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combined with Appendix A on pipelining, leads to the expectation of a load-store 
version of a general-purpose register architecture. 
With the class of architecture covered, the next topic is addressing operands. 


Memory Addressing 


Independent of whether the architecture is load-store or allows any operand to be 
a memory reference, it must define how memory addresses are interpreted and 
how they are specified. The measurements presented here are largely, but not 
completely, computer independent. In some cases the measurements are signifi- 
cantly affected by the compiler technology. These measurements have been made 
using an optimizing compiler, since compiler technology plays a critical role. 


Interpreting Memory Addresses 


How is a memory address interpreted? That is, what object is accessed as a 
function of the address and the length? All the instruction sets discussed in this 
book are byte addressed and provide access for bytes (8 bits), half words (16 
bits), and words (32 bits). Most of the computers also provide access for double 
words (64 bits). 

There are two different conventions for ordering the bytes within a larger 
object. Little Endian byte order puts the byte whose address is "x . . . xOOO" at 
the least-significant position in the double word (the little end). The bytes are 
numbered 








j7 6 5 4 3 2 1 © | 








Big Endian byte order puts the byte whose address is "x .. . XOOO" at the most- 
significant position in the double word (the big end). The bytes are numbered 





0 l 2 3 4 5 6 7 





When operating within one computer, the byte order is often unnoticeable— 
only programs that access the same locations as both, say, words and bytes can 
notice the difference. Byte order is a problem when exchanging data among com- 
puters with different orderings, however. Little Endian ordering also fails to 
match normal ordering of words when strings are compared. Strings appear 
"SDRAWKCAB" (backwards) in the registers. 

A second memory issue is that in many computers, accesses to objects larger 
than a byte must be aligned. An access to an object of size s bytes at byte address 
A is aligned if A mod s = 0. Figure B.5 shows the addresses at which an access is 
aligned or misaligned. 

Why would someone design a computer with alignment restrictions? Mis- 
alignment causes hardware complications, since the memory is typically aligned 
on a multiple of a word or double-word boundary. A misaligned memory access 
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Width of object 0 1 2 3 4 5 6 7 
| byte (byte) [Aligned] Aligned | Aligned | Aligned | Aligned | Aligned | Aligned | Aligned 
2 bytes (half word) Aligned Aligned | Aligned Aligned 
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4 bytes (word) 
4 bytes (word) 
4 bytes (word) 
8 bytes (double word) Aligned 








Misaligned Misaligned 























8 bytes (double word) Misaligned 
8 bytes (double word) Misaligned 
8 bytes (double word) Misaligned 





8 bytes (double word) Misaligned 

8 bytes (double word) Misaligned 

Misaligned 

8 bytes (double word) Misaligned 
SS 
Figure B5 Aligned and misaligned addresses of byte, half-word, word, and double-word objects for byte- 
addressed computers. For each misaligned example some objects require two memory accesses to complete. Every 
aligned object can always complete in one memory access, as long as the memory is as wide as the object. The figure 
shows the memory organized as 8 bytes wide.The byte offsets that label the columns specify the low-order 3 bits of 
the address. 














8 bytes (double word) 





may, therefore, take multiple aligned memory references. Thus, even in comput- 
ers that allow misaligned access, programs with aligned accesses run faster. 

Even if data are aligned, supporting byte, half-word, and word accesses 
requires an alignment network to align bytes, half words, and words in 64-bit reg- 
isters. For example, in Figure B.5, suppose we read a byte from an address with 
its 3 low-order bits having the value 4. We will need to shift right 3 bytes to align 
the byte to the proper place in a 64-bit register. Depending on the instruction, the 
computer may also need to sign-extend the quantity. Stores are easy: Only the 
addressed bytes in memory may be altered. On some computers a byte, half- 
word, and word operation does not affect the upper portion of a register. 
Although all the computers discussed in this book permit byte, half-word, and 
word accesses to memory, only the IBM 360/370, Intel 80x86, and VAX support 
ALU operations on register operands narrower than the full width. 

Now that we have discussed alternative interpretations of memory addresses, 
we can discuss the ways addresses are specified by instructions, called address- 
ing modes. 
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Given an address, we now know what bytes to access in memory. In this sub- 
section we will look at addressing modes—how architectures specify the address 
of an object they will access. Addressing modes specify constants and registers in 
addition to locations in memory. When a memory location is used, the actual 
memory address specified by the addressing mode is called the effective address. 

Figure B.6 shows all the data addressing modes that have been used in recent 


computers. Immediates or literals are usually considered memory addressing 









































Addressing mode Example instruction Meaning When used 
Register Add R4,R3 Regs[R4] <— Regs[R4] When a value is in a register. 
+ Regs[R3] 
Immediate Add R4, #3 Regs[R4] <— Regs[R4] + 3 For constants. 
Displacement Add R4,100(R1) Regs[R4] & Regs[R4] Accessing local variables 
+ Mem[100+Regs[R1]] (+ simulates register indirect, 
direct addressing modes). 
Register indirect Add R4,(R1) Regs[R4] <— Regs[R4] Accessing using a pointer or a 
+ Mem[Regs[R1] ] computed address. 
Indexed Add R3, (R1+R2) Regs[R3] <— Regs[R3] Sometimes useful in array 
+ Mem[Regs[R1]+Regs[R2] | addressing: R1 = base of array; 
R2 = index amount. 
Direct or Add R1, (1001) Regs[R1] + Regs[R1] Sometimes useful for accessing 
absolute + Mem[1001] static data: address constant may 
g need to be large. 

Memory indirect Add R1,@(R3) Regs[R1] <— Regs[R1] If R3 is the address of a pointer p, 
i + Mem[Mem[Regs [R3]]] then mode yields +p. 
Autoincrement Add R1, (R2)+ Regs[R1] + Regs[R1] Useful for stepping through arrays 

+ Mem[Regs[R2] ] within a loop. R2 points to start of 
Regs[R2] «— Regs[R2] + d array; cach reference increments 
i R2 by size of an element, d. 
Autodecrement Add R1,-(R2) Regs[R2] + Regs[R2] - d Same use as autoincrement. 
Regs[R1] + Regs[R1] Autodecrement/-increment can 
+ Mem[Regs[R2] ] also act as push/pop to implement 
f a stack. 
Scaled Add R1,100(R2) [R3] Regs[R1] + Regs[R1] Used to index arrays. May be 


+ Mem[100+Regs [R2] 
+ Regs [R3] +d] 


applied to any indexed addressing 
mode in some computers. 





Figure B.6 Selection of addressing modes with examples, meaning, and usage. In autoincrement/-decrement 
and scaled addressing modes, the variable d designates the size of the data item being accessed (i.e., whether the 
instruction is accessing 1, 2, 4, or 8 bytes). These addressing modes are only useful when the elements being 
accessed are adjacent in memory. RSC computers use displacement addressing to simulate register indirect with 0 
for the address and to simulate direct addressing using 0 in the base register. In our measurements, we use the first 
name shown for each mode.The extensions to C used as hardware descriptions are defined on page B-36. 
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modes (even though the value they access is in the instruction stream), although 
registers are often separated since they don't normally have memory addresses. 
We have kept addressing modes that depend on the program counter, called PC- 
relative addressing, separate. PC-relative addressing is used primarily for speci- 
fying code addresses in control transfer instructions, discussed in Section B.6. 

Figure B.6 shows the most common names for the addressing modes, though 
the names differ among architectures. In this figure and throughout the book, we 
will use an extension of the C programming language as a hardware description 
notation. In this figure, only one non-C feature is used: The left arrow (<—) is used 
for assignment. We also use the array Mm as the name for main memory and the 
array Regs for registers. Thus, Mem[Regs [RI] ] refers to the contents of the memory 
location whose address is given by the contents of register 1 (RI). Later, we will 
introduce extensions for accessing and transferring data smaller than a word. 

Addressing modes have the ability to significantly reduce instruction counts; 
they also add to the complexity of building a computer and may increase the 
average CPI (clock cycles per instruction) of computers that implement those 
modes. Thus, the usage of various addressing modes is quite important in helping 
the architect choose what to include. 

Figure B.7 shows the results of measuring addressing mode usage patterns in 
three programs on the VAX architecture. We use the old VAX architecture for a 
few measurements in this appendix because it has the richest set of addressing 
modes and the fewest restrictions on memory addressing. For example, Figure 
B.6 on page B-9 shows all the modes the VAX supports. Most measurements in 
this appendix, however, will use the more recent register-register architectures to 
show how programs use instruction sets of current computers. 

As Figure B.7 shows, displacement and immediate addressing dominate 
addressing mode usage. Let's look at some properties of these two heavily used 
modes. 


Displacement Addressing Mode 


The major question that arises for a displacement-style addressing mode is that of 
the range of displacements used. Based on the use of various displacement sizes, 
a decision of what sizes to support can be made. Choosing the displacement field 
sizes is important because they directly affect the instruction length. Figure B.8 
shows the measurements taken on the data access on a load-store architecture 
using our benchmark programs. We look at branch offsets in Section B.6—data 
accessing patterns and branches are different; little is gained by combining them, 
although in practice the immediate sizes are made the same for simplicity. 


Immediate or Literal Addressing Mode 


Immediates can be used in arithmetic operations, in comparisons (primarily for 
branches), and in moves where a constant is wanted in a register. The last case 
occurs for constants written in the code—which tend to be small—and for 
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Figure B.7 Summary of use of memory addressing modes (including immediates). 
These major addressing modes account for all but a few percent (0% to 3%) of the 
memory accesses. Register modes, which are not counted, account for one-half of the 
operand references, while memory addressing modes (including immediate) account 
for the other half. Of course, the compiler affects what addressing modes are used; see 
Section B.8.The memory indirect mode on the VAX can use displacement, autoincre- 
ment, or autodecrement to form the initial memory address; in these programs, almost 
all the memory indirect references use displacement mode as the base. Displacement 
mode includes all displacement lengths (8,16, and 32 bits). The PC-relative addressing 
modes, used almost exclusively for branches, are not included. Only the addressing 
modes with an average frequency of over 1 % are shown. 





address constants, which tend to be large. For the use of immediates it is impor- 
tant to know whether they need to be supported for all operations or for only a 
subset. Figure B.9 shows the frequency of immediates for the general classes of 
integer and floating-point operations in an instruction set. 

Another important instruction set measurement is the range of values for 
immediates. Like displacement values, the size of immediate values affects 
instruction length. As Figure B.10 shows, small immediate values are most 
heavily used. Large immediates are sometimes used, however, most likely in 
addressing calculations. 


Summary: Memory Addressing 


First, because of their popularity, we would expect a new architecture to support 
at least the following addressing modes: displacement, immediate, and register 
indirect. Figure B.7 shows that they represent 75% to 99% of the addressing 
modes used in our measurements. Second, we would expect the size of the 
address for displacement mode to be at least 12-16 bits, since the caption in Fig- 
ure B.8 suggests these sizes would capture 75% to 99% of the displacements. 
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Figure B.8 Displacement values are widely distributed. There are both a large number of small values and a fair 
number of large values.The wide distribution of displacement values is due to multiple storage areas for variables 
and different displacements to access them (see Section B.8) as well as the overall addressing scheme the compiler 
uses. The x-axis is logs of the displacement; that is, the size of a field needed to represent the magnitude of the dis- 
placement. Zero on the x-axis shows the percentage of displacements of value 0. The graph does not include the 
sign bit, which is heavily affected by the storage layout. Most displacements are positive, but a majority of the largest 
displacements (14+ bits) are negative. Since these data were collected on a computer with 16-bit displacements, 
they cannot tell us about longer displacements. These data were taken on the Alpha architecture with full optimiza- 
tion (see Section B8) for SPEC CPU2000, showing the average of integer programs (CINT2000) and the average of 


floating-point programs (CFP2000). 
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| @ Integer average 
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Figure B9 About one-quarter of data transfers and ALU operations have an imme- 
diate operand. The bottom bars show that integer programs use immediates in about 
one-fifth of the instructions, while floating-point programs use immediates in about 
one-sixth of the instructions. For loads, the load immediate instruction loads 16 bits 
into either half of a 32-bit register. Load immediates are not loads in a strict sense 
because they do not access memory. Occasionally a pair of load immediates is used to 
load a 32-bit constant, but this is rare. (For ALU operations, shifts by a constant amount 
are included as operations with immediate operands.) The programs and computer 
used to collect these statistics are the same as in Figure B8. 
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Figure B.10 The distribution of immediate values.The x-axis shows the number of bits needed to represent the 
magnitude of an immediate value—O means the immediate field value was O.The majority of the immediate values 
are positive. About 20% were negative for CINT2000, and about 30% were negative for CFP2000. These measure- 
ments were taken on an Alpha, where the maximum immediate is 16 bits, for the same programs as in Figure B.8. A 
similar measurement on the VAX, which supported 32-bit immediates, showed that about 20% to 25% of immedi- 
ates were longer than 16 bits. Thus, 16 bits would capture about 80% and 8 bits about 50%. 


R 


Third, we would expect the size of the immediate field to be at least 8-16 bits. 
This claim is not substantiated by the captions of the figure to which it refers. 

Having covered instruction set classes and decided on register-register archi- 
tectures, plus the previous recommendations on data addressing modes, we next 
cover the sizes and meanings of data. 


Type and Size of Operands 


How is the type of an operand designated? Normally, encoding in the opcode 
designates the type of an operand—this is the method used most often. Alterna- 
tively, the data can be annotated with tags that are interpreted by the hardware. 
These tags specify the type of the operand, and the operation is chosen accord- 
ingly. Computers with tagged data, however, can only be found in computer 
museums. 

Let's start with desktop and server architectures. Usually the type of an oper- 
and—integer, single-precision floating point, character, and so on—effectively 
gives its size. Common operand types include character (8 bits), half word (16 
bits), word (32 bits), single-precision floating point (also 1 word), and double- 
precision floating point (2 words). Integers are almost universally represented as 
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two's complement binary numbers. Characters are usually in ASCII, but the 16- 
bit Unicode (used in Java) is gaining popularity with the internationalization of 
computers. Until the early 1980s, most computer manufacturers chose their own 
floating-point representation. Almost all computers since that time follow the 
same standard for floating point, the IEEE standard 754. The IEEE floating-point 
standard is discussed in detail in Appendix I. 

Some architectures provide operations on character strings, although such 
operations are usually quite limited and treat each byte in the string as a single 
character. Typical operations supported on character strings are comparisons 
and moves. 

For business applications, some architectures support a decimal format, 
usually called packed decimal or binary-coded decimal—4 bits are used to 
encode the values 0-9, and 2 decimal digits are packed into each byte. Numeric 
character strings are sometimes called unpacked decimal, and operations— 
called packing and unpacking—are usually provided for converting back and 
forth between them. 

One reason to use decimal operands is to get results that exactly match deci- 
mal numbers, as some decimal fractions do not have an exact representation in 
binary. For example, 0.1010 is a simple fraction in decimal, but in binary it 
requires an infinite set of repeating digits: 0.0001100110011. . . 2. Thus, calcula- 
tions that are exact in decimal can be close but inexact in binary, which can be a 
problem for financial transactions. (See Appendix I to learn more about precise 
arithmetic.) 

Our SPEC benchmarks use byte or character, half-word (short integer), word 
(integer), double-word (long integer), and floating-point data types. Figure B.11 
shows the dynamic distribution of the sizes of objects referenced from memory 
for these programs. The frequency of access to different data types helps in 
deciding what types are most important to support efficiently. Should the com- 
puter have a 64-bit access path, or would taking two cycles to access a double 
word be satisfactory? As we saw earlier, byte accesses require an alignment net- 
work: How important is it to support bytes as primitives? Figure B.11 uses mem- 
ory references to examine the types of data being accessed. 

In some architectures, objects in registers may be accessed as bytes or half 
words. However, such access is very infrequent—on the VAX, it accounts for no 
more than 12% of register references, or roughly 6% of all operand accesses in 
these programs. 


Operations in the Instruction Set 


The operators supported by most instruction set architectures can be categorized 
as in Figure B.12. One rule of thumb across all architectures is that the most 
widely executed instructions are the simple operations of an instruction set. For 
example, Figure B.13 shows 10 simple instructions that account for 96% of 
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Figure B.11 Distribution of data accesses by size for the benchmark programs.The 
double-word data type is used for double-precision floating point in floating-point pro- 
grams and for addresses, since the computer uses 64-bit addresses. On a 32-bit address 
computer the 64-bit addresses would be replaced by 32-bit addresses,and so almost all 
double-word accesses in integer programs would become single-word accesses. 





Operator type Examples 





Arithmetic and logical Integer arithmetic and logical operations: add, subtract, and, or, 
multiply, divide 























Data transfer Loads-stores (move instructions on computers with memory 
addressing) 

Control Branch, jump, procedure call and return, traps 

System Operating system call, virtual memory management instructions 

Floating point Floating-point operations: add, multiply, divide, compare 

Decimal Decimal add, decimal multiply, decimal-to-character conversions 

String String move, string compare, string search 

Graphics Pixel and vertex operations, compression/decompression 
operations 





Figure B.I2 Categories of instruction operators and examples of each. All comput- 
ers generally provide a full set of operations for the first three categories. The support 
for system functions in the instruction set varies widely among architectures, but all 
computers must have some instruction support for basic system functions. The amount 
of support in the instruction set for the last four categories may vary from none to an 
extensive set of special instructions. Floating-point instructions will be provided in any 
computer that is intended for use in an application that makes much use of floating 
point. These instructions are sometimes part of an optional instruction set. Decimal and 
string instructions are sometimes primitives, as in the VAX or the IBM 360, or may be 
synthesized by the compiler from simpler instructions. Graphics instructions typically 
operate on many smaller data items in parallel, for example, performing eight 8-bit 
additions on two 64-bit operands. 
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Rank 80x86 instruction (% total executed) 
1 load 22% 
2 conditional branch 20% 
3 compare 16% 
4 store 12% 
5 add 8% 
6 and 6% 
7 sub 5% 
8 move register-register 4% 
9 call 1% 
10 return 1% 
Total 96% 





Figure B.13 The top 10 instructions for the 80x86. Simple instructions dominate this 
list and are responsible for 96% of the instructions executed.These percentages are the 
average of the five SPECint92 programs. 


instructions executed for a collection of integer programs running on the popular 
Intel 80x86. Hence, the implementor of these instructions should be sure to make 
these fast, as they are the common case. 

As mentioned before, the instructions in Figure B. 13 are found in every com- 
puter for every application—desktop, server, embedded—with the variations of 
operations in Figure B.12 largely depending on which data types that the instruc- 
tion set includes. 


Instructions for Control Flow 


Because the measurements of branch and jump behavior are fairly independent of 
other measurements and applications, we now examine the use of control flow 
instructions, which have little in common with the operations of the previous 
sections. 

There is no consistent terminology for instructions that change the flow of 
control. In the 1950s they were typically called transfers. Beginning in 1960 the 
name branch began to be used. Later, computers introduced additional names. 
Throughout this book we will use jump when the change in control is uncondi- 
tional and branch when the change is conditional. 

We can distinguish four different types of control flow change: 


e Conditional branches 


e Jumps 
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Figure B.14 Breakdown of control flow instructions into three classes: calls or 
returns, jumps, and conditional branches. Conditional branches clearly dominate. 
Each type is counted in one of three bars.The programs and computer used to collect 
these statistics are the same as those in Figure B.8. 


e Procedure calls 


e Procedure returns 


We want to know the relative frequency of these events, as each event is different, 
may use different instructions, and may have different behavior. Figure B.14 
shows the frequencies of these control flow instructions for a load-store computer 
running our benchmarks. 


Addressing Modes for Control Flow Instructions 


The destination address of a control flow instruction must always be specified. 
This destination is specified explicitly in the instruction in the vast majority of 
cases—procedure return being the major exception, since for return the target 
is not known at compile time. The most common way to specify the destination 
is to supply a displacement that is added to the program counter (PC). Control 
flow instructions of this sort are called PC-relative. PC-relative branches or 
jumps are advantageous because the target is often near the current instruction, 
and specifying the position relative to the current PC requires fewer bits. Using 
PC-relative addressing also permits the code to run independently of where it is 
loaded. This property, called position independence, can eliminate some work 
when the program is linked and is also useful in programs linked dynamically 
during execution. 

To implement returns and indirect jumps when the target is not known at 
compile time, a method other than PC-relative addressing is required. Here, there 
must be a way to specify the target dynamically, so that it can change at run time. 
This dynamic address may be as simple as naming a register that contains the tar- 
get address; alternatively, the jump may permit any addressing mode to be used 
to supply the target address. 


B-18 Appendix B Instruction Set Principles and Examples 


These register indirect jumps are also useful for four other important features: 


e Case or switch statements, found in most programming languages (which 
select among one of several alternatives) 


e Virtual functions or methods in object-oriented languages like C++ or Java 
(which allow different routines to be called depending on the type of the 
argument) 


e High-order functions ox function pointers in languages like C or C++ (which 
allow functions to be passed as arguments, giving some of the flavor of 
object-oriented programming) 


e Dynamically shared libraries (which allow a library to be loaded and linked 
at run time only when it is actually invoked by the program rather than loaded 
and linked statically before the program is run) 


In all four cases the target address is not known at compile time, and hence is 
usually loaded from memory into a register before the register indirect jump. 

As branches generally use PC-relative addressing to specify their targets, an 
important question concerns how far branch targets are from branches. Knowing 
the distribution of these displacements will help in choosing what branch offsets 
to support, and thus will affect the instruction length and encoding. Figure B.15 
shows the distribution of displacements for PC-relative branches in instructions. 
About 75% of the branches are in the forward direction. 
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Figure B.15 Branch distances in terms of number of instructions between the target and the branch instruction. 
The most frequent branches in the integer programs are to targets that can be encoded in 4-8 bits. This result tells us 
that short displacement fields often suffice for branches and that the designer can gain some encoding density by 
having a shorter instruction with a smaller branch displacement. These measurements were taken on a load-store 
computer (Alpha architecture) with all instructions aligned on word boundaries. An architecture that requires fewer 
instructions for the same program, such as a VAX, would have shorter branch distances. However, the number of bits 
needed for the displacement may increase if the computer has variable-length instructions to be aligned on any byte 
boundary.The programs and computer used to collect these statistics are the same as those in Figure B8. 
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Conditional Branch Options 


Since most changes in control flow are branches, deciding how to specify the 
branch condition is important. Figure B.16 shows the three primary techniques in 
use today and their advantages and disadvantages. 

One of the most noticeable properties of branches is that a large number of 
the comparisons are simple tests, and a large number are comparisons with zero. 
Thus, some architectures choose to treat these comparisons as special cases, 
especially if a compare and branch instruction is being used. Figure B.17 shows 
the frequency of different comparisons used for conditional branching. 


Procedure Invocation Options 


Procedure calls and returns include control transfer and possibly some state sav- 
ing; at a minimum the return address must be saved somewhere, sometimes in a 
special link register or just a GPR. Some older architectures provide a mecha- 
nism to save many registers, while newer architectures require the compiler to 
generate stores and loads for each register saved and restored. 

There are two basic conventions in use to save registers: either at the call site 
or inside the procedure being called. Caller saving means that the calling proce- 
dure must save the registers that it wants preserved for access after the call, and 
thus the called procedure need not worry about registers. Callee saving is the 
opposite: the called procedure must save the registers it wants to use, leaving the 
caller unrestrained.There are times when caller save must be used because of 














Name Examples How condition is tested Advantages Disadvantages 

Condition 80x86, ARM, Tests special bits set by Sometimes condition CC is extra state. Condition 

code (CC) PowerPC, ALU operations, possibly is set for free. codes constrain the ordering of 

SPARC, SuperH under program control. instructions since they pass 

information from one instruction 
to a branch. 

Condition Alpha, MIPS Tests arbitrary register Simple. Uses up a register, 

register with the result of a 

comparison. 
Compare PA-RISC, VAX Compare is part of the One instruction rather May be too much work per 
and branch branch. Often compare is than two for a branch, instruction for pipelined 


limited to subset. execution. 





Figure B.16 The major methods for evaluating branch conditions, their advantages, and their disadvantages. 
Although condition codes can be set by ALU operations that are needed for other purposes, measurements on pro- 
grams show that this rarely happens. The major implementation problems with condition codes arise when the con- 
dition code is set by a large or haphazardly chosen subset of the instructions, rather than being controlled by a bit in 
the instruction. Computers with compare and branch often limit the set of compares and use a condition register for 
more complex compares. Often, different techniques are used for branches based on floating-point comparison ver- 
sus those based on integer comparison.This dichotomy is reasonable since the number of branches that depend on 
floating-point comparisons is much smaller than the number depending on integer comparisons. 
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Figure B.I7 Frequency of different types of compares in conditional branches. Less 
than (or equal) branches dominate this combination of compiler and architecture. 
These measurements include both the integer and floating-point compares in 
branches. The programs and computer used to collect these statistics are the same as 
those in Figure B8. 


access patterns to globally visible variables in two different procedures. For 
example, suppose we have a procedure PI that calls procedure P2, and both pro- 
cedures manipulate the global variable x. If PI had allocated x to a register, it 
must be sure to save xtoa location known by P2 before the call to P2. A com- 
piler's ability to discover when a called procedure may access register-allocated 
quantities is complicated by the possibility of separate compilation. Suppose P2 
may not touch x but can call another procedure, P3, that may access x, yet P2 and 
P3 are compiled separately. Because of these complications, most compilers will 
conservatively caller save any variable that may be accessed during a call. 

In the cases where either convention could be used, some programs will be 
more optimal with callee save and some will be more optimal with caller save. As 
a result, most real systems today use a combination of the two mechanisms. This 
convention is specified in an application binary interface (ABI) that sets down the 
basic rules as to which registers should be caller saved and which should be 
callee saved. Later in this appendix we will examine the mismatch between 
sophisticated instructions for automatically saving registers and the needs of the 
compiler. 


Summary: Instructions for Control Flow 


Control flow instructions are some of the most frequently executed instructions. 
Although there are many options for conditional branches, we would expect 
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branch addressing in a new architecture to be able to jump to hundreds of instruc- 
tions either above or below the branch. This requirement suggests a PC-relative 
branch displacement of at least 8 bits. We would also expect to see register indi- 
rect and PC-relative addressing for jump instructions to support returns as well as 
many other features of current systems. 

We have now completed our instruction architecture tour at the level seen by 
an assembly language programmer or compiler writer. We are leaning toward a 
load-store architecture with displacement, immediate, and register indirect 
addressing modes. These data are 8-, 16-, 32-, and 64-bit integers and 32- and 64- 
bit floating-point data. The instructions include simple operations, PC-relative 
conditional branches, jump and link instructions for procedure call, and register 
indirect jumps for procedure return (plus a few other uses). 

Now we need to select how to represent this architecture in a form that makes 
it easy for the hardware to execute. 


Encoding an Instruction Set 


Clearly, the choices mentioned above will affect how the instructions are encoded 
into a binary representation for execution by the processor. This representation 
affects not only the size of the compiled program; it affects the implementation of 
the processor, which must decode this representation to quickly find the operation 
and its operands. The operation is typically specified in one field, called the 
opcode. As we shall see, the important decision is how to encode the addressing 
modes with the operations. 

This decision depends on the range of addressing modes and the degree of 
independence between opcodes and modes. Some older computers have one to 
five operands with 10 addressing modes for each operand (see Figure B.6). For 
such a large number of combinations, typically a separate address specifier is 
needed for each operand: The address specifier tells what addressing mode is 
used to access the operand. At the other extreme are load-store computers with 
only one memory operand and only one or two addressing modes; obviously, in 
this case, the addressing mode can be encoded as part of the opcode. 

When encoding the instructions, the number of registers and the number of 
addressing modes both have a significant impact on the size of instructions, as the 
register field and addressing mode field may appear many times in a single 
instruction. In fact, for most instructions many more bits are consumed in encod- 
ing addressing modes and register fields than in specifying the opcode. The archi- 
tect must balance several competing forces when encoding the instruction set: 


1. The desire to have as many registers and addressing modes as possible. 


2. The impact of the size of the register and addressing mode fields on the aver- 
age instruction size and hence on the average program size. 


3. A desire to have instructions encoded into lengths that will be easy to handle 
in a pipelined implementation. (The value of easily decoded instructions is 
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discussed in Appendix A and Chapter 2.) As a minimum, the architect wants 
instructions to be in multiples of bytes, rather than an arbitrary bit length. 
Many desktop and server architects have chosen to use a fixed-length instruc- 
tion to gain implementation benefits while sacrificing average code size. 


Figure B.18 shows three popular choices for encoding the instruction set. The 
first we call variable, since it allows virtually all addressing modes to be with all 
operations. This style is best when there are many addressing modes and opera- 
tions. The second choice we call fixed, since it combines the operation and the 
addressing mode into the opcode. Often fixed encoding will have only a single 
size for all instructions; it works best when there are few addressing modes and 
operations. The trade-off between variable encoding and fixed encoding is size of 
programs versus ease of decoding in the processor. Variable tries to use as few 
bits as possible to represent the program, but individual instructions can vary 
widely in both size and the amount of work to be performed. 

Let's look at an 80x86 instruction to see an example of the variable encoding: 


add EAX.IOOO(EBX) 























Operation and | Address Address ee | Address Address 
no. of operands | specifier 1 field 1 specifier n field n 
(a) Variable (e.g., Intel 80x86, VAX) 
Operation Address Address Address 
| field 1 field 2 field 3 








(b) Fixed (e.g., Alpha, ARM, MIPS, PowerPC, SPARC, SuperH) 





























Operation Address Address 
specifier field 
Operation Address Address Address 
specifier 1 specifier 2 field 
— | 1 
Operation Address Address Address 
specifier field 1 field 2 





(c) Hybrid (e.g., IBM 360/370, MIPS16, Thumb, TI TMS320C54x) 





Figure B.18 Three basic variations in instruction encoding: variable length, fixed 
length, and hybrid. The variable format can support any number of operands, with 
each address specifier determining the addressing mode and the length of the specifier 
for that operand. It generally enables the smallest code representation, since unused 
fields need not be included. The fixed format always has the same number of operands, 
with the addressing modes (if options exist) specified as part of the opcode. It generally 
results in the largest code size. Although the fields tend not to vary in their location, 
they will be used for different purposes by different instructions.The hybrid approach 
has multiple formats specified by the opcode, adding one or two fields to specify the 
addressing mode and one or two fields to specify the operand address. 
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The name add means a 32-bit integer add instruction with two operands, and this 
opcode takes 1 byte. An 80x86 address specifier is 1 or 2 bytes, specifying the 
source/destination register (EAX) and the addressing mode (displacement in this 
case) and base register (EBX) for the second operand. This combination takes 1 
byte to specify the operands. When in 32-bit mode (see Appendix J), the size of 
the address field is either 1 byte or 4 bytes. Since 1000 is bigger than 2°, the total 
length of the instruction is 


1+ 1+4=6 bytes 


The length of 80x86 instructions varies between 1 and 17 bytes. 80x86 programs 
are generally smaller than the RISC architectures, which use fixed formats (see 
Appendix J). 

Given these two poles of instruction set design of variable and fixed, the third 
alternative immediately springs to mind: Reduce the variability in size and work 
of the variable architecture but provide multiple instruction lengths to reduce 
code size. This hybrid approach is the third encoding alternative, and we'll see 
examples shortly. 


Reduced Code Size in RISCs 


As RISC computers started being used in embedded applications, the 32-bit fixed 
format became a liability since cost and hence smaller code are important. In 
response, several manufacturers offered a new hybrid version of their RISC 
instruction sets, with both 16-bit and 32-bit instructions. The narrow instructions 
support fewer operations, smaller address and immediate fields, fewer registers, 
and two-address format rather than the classic three-address format of RISC 
computers. Appendix J gives two examples, the ARM Thumb and MIPS 
MIPS 16, which both claim a code size reduction of up to 40%. 

In contrast to these instruction set extensions, IBM simply compresses its 
standard instruction set, and then adds hardware to decompress instructions as 
they are fetched from memory on an instruction cache miss. Thus, the instruction 
cache contains full 32-bit instructions, but compressed code is kept in main mem- 
ory, ROMs, and the disk. The advantage of MIPS 16 and Thumb is that instruction 
caches act as if they are about 25% larger, while IBM's CodePack means that 
compilers need not be changed to handle different instruction sets and instruction 
decoding can remain simple. 

CodePack starts with run-length encoding compression on any PowerPC pro- 
gram, and then loads the resulting compression tables in a 2 KB table on chip. 
Hence, every program has its own unique encoding. To handle branches, which 
are no longer to an aligned word boundary, the PowerPC creates a hash table in 
memory that maps between compressed and uncompressed addresses. Like a 
TLB (see Chapter 5), it caches the most recently used address maps to reduce the 
number of memory accesses. IBM claims an overall performance cost of 10%, 
resulting in a code size reduction of 35% to 40%. 

Hitachi simply invented a RISC instruction set with a fixed 16-bit format, 
called SuperH, for embedded applications (see Appendix J). It has 16 rather than 
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32 registers to make it fit the narrower format and fewer instructions, but other- 
wise looks like a classic RISC architecture. 


Summary: Encoding an Instruction Set 


Decisions made in the components of instruction set design discussed in previous 
sections determine whether the architect has the choice between variable and fixed 
instruction encodings. Given the choice, the architect more interested in code size 
than performance will pick variable encoding, and the one more interested in per- 
formance than code size will pick fixed encoding. Appendix D gives 13 examples 
of the results of architects' choices. In Appendix A and Chapter 2, the impact of 
variability on performance of the processor will be discussed further. 

We have almost finished laying the groundwork for the MIPS instruction set 
architecture that will be introduced in Section B.9. Before we do that, however, it 
will be helpful to take a brief look at compiler technology and its effect on pro- 
gram properties. 


Crosscutting Issues: The Role of Compilers 


Today almost all programming is done in high-level languages for desktop and 
server applications. This development means that since most instructions exe- 
cuted are the output of a compiler, an instruction set architecture is essentially a 
compiler target. In earlier times for these applications, architectural decisions 
were often made to ease assembly language programming or for a specific kernel. 
Because the compiler will significantly affect the performance of a computer, 
understanding compiler technology today is critical to designing and efficiently 
implementing an instruction set. 

Once it was popular to try to isolate the compiler technology and its effect on 
hardware performance from the architecture and its performance, just as it was 
popular to try to separate architecture from its implementation. This separation is 
essentially impossible with today's desktop compilers and computers. Architec- 
tural choices affect the quality of the code that can be generated for a computer 
and the complexity of building a good compiler for it, for better or for worse. 

In this section, we discuss the critical goals in the instruction set primarily 
from the compiler viewpoint. It starts with a review of the anatomy of current 
compilers. Next we discuss how compiler technology affects the decisions of the 
architect, and how the architect can make it hard or easy for the compiler to pro- 
duce good code. We conclude with a review of compilers and multimedia opera- 
tions, which unfortunately is a bad example of cooperation between compiler 
writers and architects. 


The Structure of Recent Compilers 


To begin, let's look at what optimizing compilers are like today. Figure B.19 
shows the structure of recent compilers. 
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language independent Code generator and machine-dependent 
optimizations; may include 


or be followed by assembler 


Figure B.I 9 Compilers typically consist of two to four passes, with more highly opti- 
mizing compilers having more passes. This structure maximizes the probability that a 
program compiled at various levels of optimization will produce the same output when 
given the same input. The optimizing passes are designed to be optional and may be 
skipped when faster compilation is the goal and lower-quality code is acceptable. A 
pass is simply one phase in which the compiler reads and transforms the entire pro- 
gram. (The term phase is often used interchangeably with pass.) Because the optimiz- 
ing passes are separated, multiple languages can use the same optimizing and code 
generation passes. Only a new front end is required for a new language. 


A compiler writer's first goal is correctness—all valid programs must be 
compiled correctly. The second goal is usually speed of the compiled code. Typi- 
cally, a whole set of other goals follows these two, including fast compilation, 
debugging support, and interoperability among languages. Normally, the passes 
in the compiler transform higher-level, more abstract representations into pro- 
gressively lower-level representations. Eventually it reaches the instruction set. 
This structure helps manage the complexity of the transformations and makes 
writing a bug-free compiler easier. 

The complexity of writing a correct compiler is a major limitation on the 
amount of optimization that can be done. Although the multiple-pass structure 
helps reduce compiler complexity, it also means that the compiler must order and 
perform some transformations before others. In the diagram of the optimizing 
compiler in Figure B.19, we can see that certain high-level optimizations are per- 
formed long before it is known what the resulting code will look like. Once such 
a transformation is made, the compiler can't afford to go back and revisit all 
steps, possibly undoing transformations. Such iteration would be prohibitive, 
both in compilation time and in complexity. Thus, compilers make assumptions 
about the ability of later steps to deal with certain problems. For example, com- 
pilers usually have to choose which procedure calls to expand inline before they 
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know the exact size of the procedure being called. Compiler writers call this 
problem the phase-ordering problem. 

How does this ordering of transformations interact with the instruction set 
architecture? A good example occurs with the optimization called global com- 
mon subexpression elimination. This optimization finds two instances of an 
expression that compute the same value and saves the value of the first computa- 
tion in a temporary. It then uses the temporary value, eliminating the second com- 
putation of the common expression. 

For this optimization to be significant, the temporary must be allocated to a 
register. Otherwise, the cost of storing the temporary in memory and later reload- 
ing it may negate the savings gained by not recomputing the expression. There 
are, in fact, cases where this optimization actually slows down code when the 
temporary is not register allocated. Phase ordering complicates this problem 
because register allocation is typically done near the end of the global optimiza- 
tion pass, just before code generation. Thus, an optimizer that performs this opti- 
mization must assume that the register allocator will allocate the temporary to a 
register. 

Optimizations performed by modern compilers can be classified by the style 
of the transformation, as follows: 


e High-level optimizations are often done on the source with output fed to later 
optimization passes. 


e Local optimizations optimize code only within a straight-line code fragment 
(called a basic block by compiler people). 


e Global optimizations extend the local optimizations across branches and 
introduce a set of transformations aimed at optimizing loops. 


e Register allocation associates registers with operands. 


e Processor-dependent optimizations attempt to take advantage of specific 
architectural knowledge. 


Register Allocation 


Because of the central role that register allocation plays, both in speeding up the 
code and in making other optimizations useful, it is one of the most important— 
if not the most important—of the optimizations. Register allocation algorithms 
today are based on a technique called graph coloring. The basic idea behind 
graph coloring is to construct a graph representing the possible candidates for 
allocation to a register and then to use the graph to allocate registers. Roughly 
speaking, the problem is how to use a limited set of colors so that no two adjacent 
nodes in a dependency graph have the same color. The emphasis in the approach 
is to achieve 100% register allocation of active variables. The problem of color- 
ing a graph in general can take exponential time as a function of the size of the 
graph (NP-complete). There are heuristic algorithms, however, that work well in 
practice, yielding close allocations that run in near-linear time. 
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Graph coloring works best when there are at least 16 (and preferably more) 
general-purpose registers available for global allocation for integer variables and 
additional registers for floating point. Unfortunately, graph coloring does not 
work very well when the number of registers is small because the heuristic algo- 
rithms for coloring the graph are likely to fail. 


Impact of Optimizations on Performance 


It is sometimes difficult to separate some of the simpler optimizations—local and 
processor-dependent optimizations—from transformations done in the code gen- 
erator. Examples of typical optimizations are given in Figure B.20. The last col- 
umn of Figure B.20 indicates the frequency with which the listed optimizing 
transforms were applied to the source program. 

Figure B.21 shows the effect of various optimizations on instructions exe- 
cuted for two programs. In this case, optimized programs executed roughly 25% 
to 90% fewer instructions than unoptimized programs. The figure illustrates the 
importance of looking at optimized code before suggesting new instruction set 
features, since a compiler might completely remove the instructions the architect 
was trying to improve. 


The Impact of Compiler Technology on the Architect's 
Decisions 


The interaction of compilers and high-level languages significantly affects how 
programs use an instruction set architecture. There are two important questions: 
How are variables allocated and addressed? How many registers are needed to 
allocate variables appropriately? To address these questions, we must look at the 
three separate areas in which current high-level languages allocate their data: 


e The stack is used to allocate local variables. The stack is grown or shrunk on 
procedure call or return, respectively. Objects on the stack are addressed rela- 
tive to the stack pointer and are primarily scalars (single variables) rather than 
arrays. The stack is used for activation records, not as a stack for evaluating 
expressions. Hence, values are almost never pushed or popped on the stack. 


e The global data area is used to allocate statically declared objects, such as 
global variables and constants. A large percentage of these objects are arrays 
or other aggregate data structures. 


e The heap is used to allocate dynamic objects that do not adhere to a stack dis- 
cipline. Objects in the heap are accessed with pointers and are typically not 
scalars. 


Register allocation is much more effective for stack-allocated objects than for 
global variables, and register allocation is essentially impossible for heap- 
allocated objects because they are accessed with pointers. Global variables and 
some stack variables are impossible to allocate because they are aliased—there 
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Optimization name 


Explanation 


Percentage of the total number of 
optimizing transforms 














High-level At or near the source level; processor- 
independent 

Procedure integration Replace procedure call by procedure body N.M. 

Local Within straight-line code 

Common subexpression Replace two instances of the same 18% 

elimination computation by single copy 

Constant propagation Replace all instances of a variable that 22% 
is assigned a constant with the constant 

Stack height reduction Rearrange expression tree to minimize N.M. 
resources needed for expression evaluation 

Global Across a branch 

Global common subexpression Same as local, but this version crosses 13% 

elimination branches 

Copy propagation Replace all instances of a variable A that has 11% 
been assigned X (i.e., A = X) with X 

Code motion Remove code from a loop that computes 16% 
same value each iteration of the loop 

Induction variable elimination Simplify/eliminate array addressing 2% 
calculations within loops 

Processor-dependent Depends on processor knowledge 

Strength reduction Many examples, such as replace multiply by N.M. 
a constant with adds and shifts 

Pipeline scheduling Reorder instructions to improve pipeline N.M. 
performance 

Branch offset optimization Choose the shortest branch displacement that N.M. 


reaches target 





Figure B.20 Major types of optimizations and examples in each class.These data tell us about the relative fre- 
quency of occurrence of various optimizations. The third column lists the static frequency with which some of the 
common optimizations are applied in a set of 12 small FORTRAN and Pascal programs.There are nine local and glo- 
bal optimizations done by the compiler included in the measurement. Six of these optimizations are covered in the 
figure, and the remaining three account for 18% of the total static occurrences. The abbreviation N.M. means that 
the number of occurrences of that optimization was not measured. Processor-dependent optimizations are usually 
done in a code generator, and none of those was measured in this experiment. The percentage is the portion of the 
static optimizations that are of the specified type. Data from Chow [1983] (collected using the Stanford UCODE 


compiler). 


are multiple ways to refer to the address of a variable, making it illegal to put it 
into a register. (Most heap variables are effectively aliased for today's compiler 
technology.) 


For example, consider the following code sequence, where & returns the 


address of a variable and * dereferences a pointer: 
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E Branches/calls 
@ Floating-point ALU ops 

















11% 
lunasi levers ‘ E Loads-stores 
lucas, level 2 12% G Integer ALU ops 
lucas, level 1 21% 
Program, 
compiler lucas, level 0 100% 
optimization 
level mcf, level 3 pA 76% 
mef, level 2 76% 
mcf, level 1 84% 
mef, level O PO 100% 
0% 20% 40% 60% 80% 100% 


Percentage of unoptimized instructions executed 


Figure B.21 Change in instruction count for the programs lucas and mcf from the 
SPEC2000 as compiler optimization levels vary. Level 0 is the same as unoptimized 
code. Level 1 includes local optimizations, code scheduling, and local register alloca- 
tion. Level 2 includes global optimizations, loop transformations (software pipelining), 
and global register allocation. Level 3 adds procedure integration. These experiments 
were performed on the Alpha compilers. 


p = &a -- gets address of a in p 
A = we -- assigns to a directly 
KP = sen -- uses p to assign to a 
EE AT -- accesses a 


The variable a could not be register allocated across the assignment to *p without 
generating incorrect code. Aliasing causes a substantial problem because it is 
often difficult or impossible to decide what objects a pointer may refer to. A 
compiler must be conservative; some compilers will not allocate any local vari- 
ables of a procedure in a register when there is a pointer that may refer to one of 
the local variables. 


How the Architect Can Help the Compiler Writer 


Today, the complexity of a compiler does not come from translating simple state- 
ments like A= B + C. Most programs are locally simple, and simple translations 
work fine. Rather, complexity arises because programs are large and globally 
complex in their interactions, and because the structure of compilers means deci- 
sions are made one step at a time about which code sequence is best. 

Compiler writers often are working under their own corollary of a basic prin- 
ciple in architecture: Make the frequent cases fast and the rare case correct. That 
is, if we know which cases are frequent and which are rare, and if generating 
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code for both is straightforward, then the quality of the code for the rare case may 
not be very important—but it must be correct! 

Some instruction set properties help the compiler writer. These properties 
should not be thought of as hard-and-fast rules, but rather as guidelines that will 
make it easier to write a compiler that will generate efficient and correct code. 


e Provide regularity—Whenever it makes sense, the three primary components 
of an instruction set—the operations, the data types, and the addressing 
modes—should be orthogonal. Two aspects of an architecture are said to be 
orthogonal if they are independent. For example, the operations and address- 
ing modes are orthogonal if, for every operation to which one addressing 
mode can be applied, all addressing modes are applicable. This regularity 
helps simplify code generation and is particularly important when the deci- 
sion about what code to generate is split into two passes in the compiler. A 
good counterexample of this property is restricting what registers can be used 
for a certain class of instructions. Compilers for special-purpose register 
architectures typically get stuck in this dilemma. This restriction can result in 
the compiler finding itself with lots of available registers, but none of the 
right kind! 


e Provide primitives, not solutions—Special features that "match" a language 
construct or a kernel function are often unusable. Attempts to support high- 
level languages may work only with one language, or do more or less than is 
required for a correct and efficient implementation of the language. An exam- 
ple of how such attempts have failed is given in Section B.10. 


e Simplify trade-offs among alternatives—One of the toughest jobs a compiler 
writer has is figuring out what instruction sequence will be best for every seg- 
ment of code that arises. In earlier days, instruction counts or total code size 
might have been good metrics, but—as we saw in Chapter 1—this is no 
longer true. With caches and pipelining, the trade-offs have become very 
complex. Anything the designer can do to help the compiler writer under- 
stand the costs of alternative code sequences would help improve the code. 
One of the most difficult instances of complex trade-offs occurs in a register- 
memory architecture in deciding how many times a variable should be ref- 
erenced before it is cheaper to load it into a register. This threshold is hard to 
compute and, in fact, may vary among models of the same architecture. 


e Provide instructions that bind the quantities known at compile time as con- 
stants—A compiler writer hates the thought of the processor interpreting at 
run time a value that was known at compile time. Good counterexamples of 
this principle include instructions that interpret values that were fixed at com- 
pile time. For instance, the VAX procedure call instruction (cal 1 s) dynami- 
cally interprets a mask saying what registers to save on a call, but the mask is 
fixed at compile time (see Section B.10). 


Example 
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Compiler Support (or Lack Thereof) for Multimedia 
Instructions 


Alas, the designers of the SIMD instructions that operate on several narrow data 
items in a single clock cycle consciously ignored the previous subsection. These 
instructions tend to be solutions, not primitives; they are short of registers; and 
the data types do not match existing programming languages. Architects hoped to 
find an inexpensive solution that would help some users, but in reality, only a few 
low-level graphics library routines use them. 

The SIMD instructions are really an abbreviated version of an elegant architec- 
ture style that has its own compiler technology. As explained in Appendix F, vector 
architectures operate on vectors of data. Invented originally for scientific codes, 
multimedia kernels are often vectorizable as well, albeit often with shorter vectors. 
Hence, we can think of Intel's MMX and SSE or PowerPC's AltiVec as simply 
short vector computers: MMX with vectors of eight 8-bit elements, four 16-bit ele- 
ments, or two 32-bit elements, and AltiVec with vectors twice that length. They are 
implemented as simply adjacent, narrow elements in wide registers. 

These microprocessor architectures build the vector register size into the 
architecture: the sum of the sizes of the elements is limited to 64 bits for MMX 
and 128 bits for AltiVec. When Intel decided to expand to 128-bit vectors, it 
added a whole new set of instructions, called Streaming SIMD Extension (SSE). 

A major advantage of vector computers is hiding latency of memory access 
by loading many elements at once and then overlapping execution with data 
transfer. The goal of vector addressing modes is to collect data scattered about 
memory, place them in a compact form so that they can be operated on effi- 
ciently, and then place the results back where they belong. 

Over the years traditional vector computers added strided addressing and 
gather/scatter addressing to increase the number of programs that can be vector- 
ized. Strided addressing skips a fixed number of words between each access, so 
sequential addressing is often called unit stride addressing. Gather and scatter 
find their addresses in another vector register: Think of it as register indirect 
addressing for vector computers. From a vector perspective, in contrast, these 
short-vector SIMD computers support only unit strided accesses: Memory 
accesses load or store all elements at once from a single wide memory location. 
Since the data for multimedia applications are often streams that start and end in 
memory, strided and gather/scatter addressing modes are essential to successful 
vectorization. 


As an example, compare a vector computer to MMX for color representation 
conversion of pixels from RGB (red green blue) to YUV (luminosity chromi- 
nance), with each pixel represented by 3 bytes. The conversion is just three lines 
of C code placed in a loop: 


Y = (9798°R + 19235°G + 3736*B)/ 32768; 
U = (-4784*R - 9487°G + 4221*B)/ 32768 + 128; 
V = (20218°R - 16941*°G - 3277*B)/ 32768 + 128; 
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A 64-bit-wide vector computer can calculate 8 pixels simultaneously. One vector 
computer for media with strided addresses takes 


e 3 vector loads (to get RGB) 

e 3 vector multiplies (to convert R) 

e 6 vector multiply adds (to convert G and B) 
e 3 vector shifts (to divide by 32,768) 

e 2 vector adds (to add 128) 


e 3 vector stores (to store YUV) 


The total is 20 instructions to perform the 20 operations in the previous C code to 
convert 8 pixels [Kozyrakis 2000]. (Since a vector might have 32 64-bit elements, 
this code actually converts up to 32 x 8 or 256 pixels.) 

In contrast, Intel's Web site shows that a library routine to perform the same 
calculation on 8 pixels takes 116 MMX instructions plus 6 80x86 instructions 
[Intel 2001]. This sixfold increase in instructions is due to the large number of 
instructions to load and unpack RGB pixels and to pack and store YUV pixels, 
since there are no strided memory accesses. 


Having short, architecture-limited vectors with few registers and simple 
memory addressing modes makes it more difficult to use vectorizing compiler 
technology. Another challenge is that no programming language (yet) has support 
for operations on these narrow data. Hence, these SIMD instructions are likely to 
be found in hand-coded libraries than in compiled code. 


Summary:The Role of Compilers 


This section leads to several recommendations. First, we expect a new instruction 
set architecture to have at least 16 general-purpose registers—not counting sepa- 
rate registers for floating-point numbers—to simplify allocation of registers using 
graph coloring. The advice on orthogonality suggests that all supported address- 
ing modes apply to all instructions that transfer data. Finally, the last three pieces 
of advice—provide primitives instead of solutions, simplify trade-offs between 
alternatives, don't bind constants at run time—all suggest that it is better to err on 
the side of simplicity. In other words, understand that less is more in the design of 
an instruction set. Alas, SIMD extensions are more an example of good market- 
ing than of outstanding achievement of hardware-software co-design. 


Putting It All Together:The MIPS Architecture 


In this section we describe a simple 64-bit load-store architecture called MIPS. 
The instruction set architecture of MIPS and RISC relatives was based on obser- 
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vations similar to thos. covered in the last sections. (In Section K.3 we discuss 
how and why these architectures became popular.) Reviewing our expectations 
from each section, for desktop applications: 


e Section B.2—Use general-purpose registers with a load-store architecture. 


e Section B.3—Support these addressing modes: displacement (with an address 
offset size of 12-16 bits), immediate (size 8-16 bits), and register indirect. 


e Section B,4—Support these data sizes and types: 8-, 16-, 32-, and 64-bit inte- 
gers and 64-bit IEEE 754 floating-point numbers. 


e Section B.5—Support these simple instructions, since they will dominate the 
number of instructions executed: load, store, add, subtract, move register- 
register, and shift. 


e Section B.6—Compare equal, compare not equal, compare less, branch (with 
a PC-relative address at least 8 bits long), jump, call, and return. 


e Section B. 7—Use fixed instruction encoding if interested in performance, and 
use variable instruction encoding if interested in code size. 


e Section B.8—Provide at least 16 general-purpose registers, be sure all 
addressing modes apply to all data transfer instructions, and aim for a mini- 
malist instruction set. This section didn't cover floating-point programs, but 
they often use separate floating-point registers. The justification is to increase 
the total number of registers without raising problems in the instruction for- 
mat or in the speed of the general-purpose register file. This compromise, 
however, is not orthogonal. 


We introduce MIPS by showing how it follows these recommendations. Like 
most recent computers, MIPS emphasizes 


e a simple load-store instruction set 


e design for pipelining efficiency (discussed in Appendix A), including a fixed 
instruction set encoding 


e efficiency as a compiler target 


MIPS provides a good architectural model for study, not only because of the pop- 
ularity of this type of processor, but also because it is an easy architecture to 
understand. We will use this architecture again in Appendix A and in Chapters 2 
and 3, and it forms the basis for a number of exercises and programming projects. 

In the years since the first MIPS processor in 1985, there have been many ver- 
sions of MIPS (see Appendix J). We will use a subset of what is now called 
MIPS64, which will often abbreviate to just MIPS, but the full instruction set is 
found in Appendix J. 
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Registers for MIPS 


MIPS64 has 32 64-bit general-purpose registers (GPRs), named RO, R1,..., R31. 
GPRs are also sometimes known as integer registers. Additionally, there is a set 
of 32 floating-point registers (FPRs), named FO, F1,.. ., F31, which can hold 32 
single-precision (32-bit) values or 32 double-precision (64-bit) values. (When 
holding one single-precision number, the other half of the FPR is unused.) Both 
single- and double-precision floating-point operations (32-bit and 64-bit) are pro- 
vided. MIPS also includes instructions that operate on two single-precision oper- 
ands in a single 64-bit floating-point register. 

The value of RO is always 0. We shall see later how we can use this register to 
synthesize a variety of useful operations from a simple instruction set. 

A few special registers can be transferred to and from the general-purpose 
registers. An example is the floating-point status register, used to hold informa- 
tion about the results of floating-point operations. There are also instructions for 
moving between an FPR and a GPR. 


Data Types for MIPS 


The data types are 8-bit bytes, 16-bit half words, 32-bit words, and 64-bit double 
words for integer data and 32-bit single precision and 64-bit double precision for 
floating point. Half words were added because they are found in languages like C 
and are popular in some programs, such as the operating systems, concerned 
about size of data structures. They will also become more popular if Unicode 
becomes widely used. Single-precision floating-point operands were added for 
similar reasons. (Remember the early warning that you should measure many 
more programs before designing an instruction set.) 

The MIPS64 operations work on 64-bit integers and 32- or 64-bit floating 
point. Bytes, half words, and words are loaded into the general-purpose registers 
with either zeros or the sign bit replicated to fill the 64 bits of the GPRs. Once 
loaded, they are operated on with the 64-bit integer operations. 


Addressing Modes for MIPS Data Transfers 


The only data addressing modes are immediate and displacement, both with 16- 
bit fields. Register indirect is accomplished simply by placing 0 in the 16-bit dis- 
placement field, and absolute addressing with a 16-bit field is accomplished by 
using register 0 as the base register. Embracing zero gives us four effective 
modes, although only two are supported in the architecture. 

MIPS memory is byte addressable with a 64-bit address. It has a mode bit that 
allows software to select either Big Endian or Little Endian. As it is a load-store 
architecture, all references between memory and either GPRs or FPRs are 
through loads or stores. Supporting the data types mentioned above, memory 
accesses involving GPRs can be to a byte, half word, word, or double word. The 
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-type instruction 
5 5 16 


6 
CDR = 


Encodes: Loads and Stores of bytes, half words, words, 
double words. All immediates (rt - rs op immediate) 


Conditional branch instructions (rs is register, rd unused) 
Jump register, jump and link register 
{rd = 0, rs = destination, immediate = 0) 


R-type instruction 


6 5 5 5 5 6 
Fone Jom [oe | femme] ant” 
Register-register ALU operations: rd ~- rs funct rt 


Function encodes the data path operation: Add, Sub, .. . 
Read/write special registers and moves 


J-type instruction 
26 


6 
Orie added to PC 


Jump and jump and link 
Trap and return from exception 





Figure B.22 Instruction layout for MIPS. All instructions are encoded in one of three 
types, with common fields in the same location in each format. 


FPRs may be loaded and stored with single-precision or double-precision num- 
bers. All memory accesses must be aligned. 


MIPS Instruction Format 


Since MIPS has just two addressing modes, these can be encoded into the 
opcode. Following the advice on making the processor easy to pipeline and 
decode, all instructions are 32 bits with a 6-bit primary opcode. Figure B.22 
shows the instruction layout. These formats are simple while providing 16-bit 
fields for displacement addressing, immediate constants, or PC-relative branch 
addresses. 

Appendix J shows a variant of MIPS—called MIPS 16—which has 16-bit and 
32-bit instructions to improve code density for embedded applications. We will 
stick to the traditional 32-bit format in this book. 


MIPS Operations 


MIPS supports the list of simple operations recommended above plus a few oth- 
ers. There are four broad classes of instructions: loads and stores, ALU opera- 
tions, branches and jumps, and floating-point operations. 
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Any of the general-purpose or floating-point registers may be loaded or 


stored, except that loading RO has no effect. Figure B.23 gives examples of the 
load and store instructions. Single-precision floating-point numbers occupy half a 
floating-point register. Conversions between single and double precision must be 
done explicitly. The floating-point format is IEEE 754 (see Appendix I). A list of 
all the MIPS instructions in our subset appears in Figure B.26 (page B-40). 


To understand these figures we need to introduce a few additional extensions 


to our C description language used initially on page B-9: 


A subscript is appended to the symbol <— whenever the length of the datum 
being transferred might not be clear. Thus, <-,, means transfer an n-bit quan- 
tity. We use x, y & z to indicate that z should be transferred to x and y. 







































































æ A subscript is used to indicate selection of a bit from a field. Bits are labeled 
from the most-significant bit starting at 0. The subscript may be a single digit 
(e.g., Regs [R4]o yields the sign bit of R4) or a subrange (e.g., Regs [R3] 56 63 
yields the least-significant byte of R3). 
e The variable Mem, used as an array that stands for main memory, is indexed by 
a byte address and may transfer any number of bytes. 
e A superscript is used to replicate a field (e.g., 0** yields a field of zeros of 
length 48 bits). 
e The symbol ## is used to concatenate two fields and may appear on either 
side of a data transfer. 
Example instruction Instruction name Meaning 
LD R1,30(R2) Load double word Regs [R1]<-¢4 Mem[30+Regs [R2] ] 
LD R1,1000(RO) Load double word Regs [R1] <,, Mem[1000+0] 
LW R1,60(R2) Load word Regs[R1]<¢, (Mem[60+Regs[R2]],)°* ## Mem[60+Regs[R2] ] 
LB R1,40(R3) Load byte Regs [R1]<, (Mem[40+Regs[R3]]q)°° ## 
Mem[40+Regs [R3]] 
LBU R1,40(R3) Load byte unsigned Regs[R1]<¢, 0°° ## Mem[40+Regs[R3]] 
LH R1,40(R3) Load half word Regs[R1]<,, (Mem[40+Regs[R3]],)** ## 
Mem[40+Regs[R3]] ## Mem[41+Regs[R3] ] 
L.S FO,50(R3) Load FP single Regs [F0]<,, Mem[50+Regs(R3]] ## 0° 
L.D F0,50(R2) Load FP double Regs [FO] Ena Mem[50+Regs [R2] ] 
SD R3,500(R4) Store double word Mem[500+Regs [R4] ] —64 Regs [R3] 
SW R3,500(R4) Store word Mem[500+Regs [R4] ] 4—32 Regs [R3] 37 63 
S.S F0,40(R3) Store FP single Mem[40+Regs [R3]] =; Regs [F0] 9.31 _ 
S.D F0,40(R3) Store FP double Mem[40+Regs [R3] ] —64 Regs [F0] 
SH R3,502(R2) Store half Mem[502+Regs [R2] ] —16 Regs [R3] 4s 63 
SB R2,41(R3) Store byte Mem[41+Regs[R3]]< Regs [R2] 56.63 





Figure B.23 The load and store instructions in MIPS. All use a single addressing mode and require that the mem- 
ory value be aligned. Of course, both loads and stores are available for all the data types shown. 
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Example instruction Instruction name Meaning 

DADOU R1.R2.R3 Add unsigned Regs [R1] <-Regs [R2] +Regs [R3] 
DADDIU R1,R2,#3 Add immediate unsigned Regs[RI]^Regs[R2]+3 

LUI RI,#42 Load upper immediate Regs[RI]<-0°°##42##0'° 

DSLL R1,R2,#5 Shift left logical Regs[Rl]f-Regs[R2]«5 

SLT  R1.R2.R3 Set less than if (Regs[R2]<Regs[R3]) 


Regs [RI] 4-1 else Regs[Rl]<-0 





Figure B.24 Examples of arithmetic/logical instructions on MIPS, both with and 
without immediates. 


As an example, assuming that R8 and RIO are 64-bit registers: 
Regs [RIO] 32..63 <- 32 (Mem [Regs [R8]]o)°* ## Mem[Regs[R8]] 


means that the byte at the memory location addressed by the contents of register 
R8 is sign-extended to form a 32-bit quantity that is stored into the lower half of 
register RIO. (The upper half of RIO is unchanged.) 

All ALU instructions are register-register instructions. Figure B.24 gives 
some examples of the arithmetic/logical instructions. The operations include sim- 
ple arithmetic and logical operations: add, subtract, AND, OR, XOR, and shifts. 
Immediate forms of all these instructions are provided using a 16-bit sign- 
extended immediate. The operation LUI (load upper immediate) loads bits 32 
through 47 of a register, while setting the rest of the register to 0. LUI allows a 32- 
bit constant to be built in two instructions, or a data transfer using any constant 
32-bit address in one extra instruction. 

As mentioned above, RO is used to synthesize popular operations. Loading a 
constant is simply an add immediate where the source operand is RO, and a 
register-register move is simply an add where one of the sources is RO. (We 
sometimes use the mnemonic LI, standing for load immediate, to represent the 
former, and the mnemonic MY for the latter.) 


MIPS Control Flow Instructions 


MIPS provides compare instructions, which compare two registers to see if the 
first is less than the second. If the condition is true, these instructions place a 1 in 
the destination register (to represent true); otherwise they place the value 0. 
Because these operations "set" a register, they are called set-equal, set-not-equal, 
set-less-than, and so on. There are also immediate forms of these compares. 
Control is handled through a set of jumps and a set of branches. Figure B.25 
gives some typical branch and jump instructions. The four jump instructions are 
differentiated by the two ways to specify the destination address and by whether 
or not a link is made. Two jumps use a 26-bit offset shifted 2 bits and then replace 
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Example 

instruction Instruction name Meaning 

J name Jump PC3¢_63¢-name 

JAL name Jump and link Regs[R31]<PC+8; PC3___¢3¢-name; 
((PC+4)-277) < name < ((PC+4)+227) 

JALR R2 Jump and link register Regs[R31]<-PC+8; PC<Regs[R2] 

JR R3 Jump register PC<Regs [R3] 





BEQZ R4, name Branch equal zero if (Regs[R4]==0) PC&-name; 
((PC+4)-27) < name < ((PC+4)+2!7) 


BNE R3,R4,name Branch not equal zero if (Regs[R3]!= Regs[R4]) PC<name; 
((PC+4)-2!7) < name < ((PC+4)+21") 


MOVZ R1,R2,R3 Conditional move if (Regs[R3}==0) Regs[R1}]<-Regs[R2] 
if zero 











Figure B.25 Typical control flow instructions in MIPS. All control instructions, except 
jumps to an address in a register, are PC-relative. Note that the branch distances are 
longer than the address field would suggest; since MIPS instructions are all 32 bits long, 
the byte branch address is multiplied by 4 to get a longer distance. 


the lower 28 bits of the program counter (of the instruction sequentially follow- 
ing the jump) to determine the destination address. The other two jump instruc- 
tions specify a register that contains the destination address. There are two flavors 
of jumps: plain jump and jump and link (used for procedure calls). The latter 
places the return address—the address of the next sequential instruction—in R31. 

All branches are conditional. The branch condition is specified by the 
instruction, which may test the register source for zero or nonzero; the register 
may contain a data value or the result of a compare. There are also conditional 
branch instructions to test for whether a register is negative and for equality 
between two registers. The branch-target address is specified with a 16-bit signed 
offset that is shifted left two places and then added to the program counter, which 
is pointing to the next sequential instruction. There is also a branch to test the 
floating-point status register for floating-point conditional branches, described 
later. 

Appendix A and Chapter 2 show that conditional branches are a major chal- 
lenge to pipelined execution; hence many architectures have added instructions to 
convert a simple branch into a conditional arithmetic instruction. MIPS included 
conditional move on zero or not zero. The value of the destination register either 
is left unchanged or is replaced by a copy of one of the source registers depend- 
ing on whether or not the value of the other source register is zero. 


MIPS Floating-Point Operations 


Floating-point instructions manipulate the floating-point registers and indicate 
whether the operation to be performed is single or double precision. The opera- 
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tions MOVS and MOVD copy a single-precision (MOVS) or double-precision 
(MOV. D) floating-point register to another register of the same type. The opera- 
tions MECI, MICI, DMFC1, DMICI move data between a single or double floating- 
point register and an integer register. Conversions from integer to floating point 
are also provided, and vice versa. 

The floating-point operations are add, subtract, multiply, and divide; a suffix 
D is used for double precision, and a suffix S is used for single precision (e.g., 
ADDD, ADDS, SUB.D, SUB.S, MULD, MULS, DIV.D, DIV.S). Floating-point 
compares set a bit in the special floating-point status register that can be tested 
with a pair of branches: BCIT and BCIF, branch floating-point true and branch 
floating-point false. 

To get greater performance for graphics routines, MIPS64 has instructions 
that perform two 32-bit floating-point operations on each half of the 64-bit 
floating-point register. These paired single operations include ADD. PS, SUB.PS, 
MULPS, and DIV.PS. (They are loaded and stored using double-precision loads 
and stores.) 

Giving a nod toward the importance of multimedia applications, MIPS64 also 
includes both integer and floating-point multiply-add instructions: MADD, MADD S, 
MADDD, and MADDPS. The registers are all the same width in these combined 
operations. Figure B.26 contains a list of a subset of MIPS64 operations and their 
meaning. 


MIPS Instruction Set Usage 


To give an idea which instructions are popular, Figure B.27 shows the frequency 
of instructions and instruction classes for five SPECint2000 programs, and Figure 
B.28 shows the same data for five SPECfp2000 programs. 


Fallacies and Pitfalls 


Architects have repeatedly tripped on common, but erroneous, beliefs. In this 
section we look at a few of them. 


Designing a "high-level"instruction set feature specifically oriented to supporting 
a high-level language structure. 


Attempts to incorporate high-level language features in the instruction set have 
led architects to provide powerful instructions with a wide range of flexibility. 
However, often these instructions do more work than is required in the frequent 
case, or they don't exactly match the requirements of some languages. Many 
such efforts have been aimed at eliminating what in the 1970s was called the 
semantic gap. Although the idea is to supplement the instruction set with 
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Instruction type/opcode 


Data transfers 


LB, LBU, SB 

LH, LHU, SH 

LW, LWU, SW 

LD, SD 
L.S5L.0,8.8;5.0 
MFCO,MTCO 
MOV.S,MOV.D 
MFC1,MTC1 
Arithmetic/logical 
DADD, DADDI , DADDU, DADDIU 
DSUB , DSUBU 


DMUL,DMULU,DDIV, 
DDIVU,MADD 


AND, ANDI 
OR,ORI,XOR, XORI 
LUI 


DSLL,DSRL,DSRA,DSLLV, 
DSRLV ,DSRAV 


SLT,SLTI,SLTU,SLTIU 


Instruction meaning 

Move data between registers and memory, or between the integer and FP or special 
registers; only memory address mode is 16-bit displacement + contents of a GPR 
Load byte, load byte unsigned, store byte (to/from integer registers) 

Load half word, load half word unsigned, store half word (to/from integer registers) 
Load word. load word unsigned, store word (to/from integer registers) 

Load double word, store double word (to/from integer registers) 

Load SP float, load DP float, store SP float, store DP float 

Copy from/to GPR to/from a special register 

Copy one SP or DP FP register to another FP register 

Copy 32 bits to/from FP registers from/to integer registers 

Operations on integer or logical data in GPRs; signed arithmetic trap on overflow 
Add, add immediate (all immediates are 16 bits); signed and unsigned 

Subtract; signed and unsigned 

Multiply and divide, signed and unsigned; multiply-add; all operations take and yield 64- 
bit values 

And, and immediate 

Or, or immediate, exclusive or, exclusive or immediate 

Load upper immediate; loads bits 32 to 47 of register with immediate, then sign-extends 
Shifts: both immediate (DS__) and variable form (DS___V); shifts are shift left logical, 
right logical, right arithmetic 

Set less than, set less than immediate; signed and unsigned 





Control 

BEQZ, BNEZ 

BEQ, BNE 

BC1T,BC1F 

MOVN, MOVZ 

J,JR 

JAL,JALR 

TRAP 

ERET 

Floating point 
ADD.D,ADD.S,ADD.PS 
SUB.D,SUB.S,SUB.PS 
MUL.D,MUL.S,MUL. PS 
MADD.D,MADD.S,MADD.PS 
DIV.D,DIV.S,DIV.PS 
CVI. 


Ae DA 








Conditional branches and jumps; PC-relative or through register 

Branch GPRs equal/not equal to zero; 16-bit offset from PC + 4 

Branch GPR equal/not equal; 16-bit offset from PC + 4 

Test comparison bit in the FP status register and branch; 16-bit offset from PC + 4 
Copy GPR to another GPR if third GPR is negative, zero 

Jumps: 26-bit offset from PC + 4 (J) or target in register (JR) 

Jump and link: save PC + 4 in R31, target is PC-relative (JAL) or a register (JALR) 
Transfer to operating system at a vectored address 

Return to user code from an exception: restore user mode 

FP operations on DP and SP formats 

Add DP, SP numbers, and pairs of SP numbers 

Subtract DP. SP numbers, and pairs of SP numbers 

Multiply DP, SP floating point, and pairs of SP numbers 

Multiply-add DP, SP numbers and pairs of SP numbers 

Divide DP, SP floating point, and pairs of SP numbers 


Convert instructions: CVT.x.y converts from type x to type y, where x and y are L 
(64-bit integer), W (32-bit integer), D (DP), or S (SP). Both operands are FPRs. 


DP and SP compares: “__" = LT,GT,LE,GE,EQ,NE: sets bit in FP status register 





Figure B.26 Subset of the instructions in MIPS64. Figure B.22 lists the formats of these instructions. SP = single 
precision; DP = double precision. This list can also be found on the back inside cover. 
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Integer 
Instruction gap gec gzip mcf perlbmk average 
load 26.5% 25.1% 20.1% 30.3% 28.7% 26% 
store 10.3% 13.2% 5.1% 4.3% 16.2% 10% 
add 21.1% 19.0% 26.9% 10.1% 16.7% 19% 
sub 1.7% 2.2% 5.1% 3.7% 2.5% 3% 
mul 14% 0.1% 0% 
compare 2.8% 6.1% 6.6% 6.3% 3.8% 5% 
load imm 4.8% 2.5% 15% 0.1% 17% 2% 
cond branch 9.3% 12.1% 11.0% 17.5% 10.9% 12% 
cond move 0.4% 0.6% 11% 0.1% 19% 1% 
jump 0.8% 0.7% 0.8% 0.7% 17% 1% 
call 1.6% 0.6% 0.4% 3.2% 11% 1% 
return 1.6% 0.6% 0.4% 3.2% 11% 1% 
shift 3.8% 1.1% 2.1% 11% 0.5% 2% 
and 4.3% 4.6% 9.4% 0.2% 12% 4% 
or 7.9% 8.5% 48% 17.6% 8.7% 9% 
xor 18% 2.1% 44% 15% 2.8% 3% 
other logical 0.1% 0.4% 0.1% 0.1% 0.3% 0% 
load FP 0% 
store FP 0% 
addFP 0% 
subFP 0% 
mulFP 0% 
divFP 0% 
mov reg-reg FP 0% 
compare FP 0% 
cond mov FP 0% 
other FP 0% 





Figure B.27 MIPS dynamic instruction mix for five SPECint2000 programs. Note that integer register-register 
move instructions are included in the or instruction. Blank entries have the value 0.0%. 


additions that bring the hardware up to the level of the language, the additions 
can generate what Wulf [1981] has called a semantic clash: 


... by giving too much semantic content to the instruction, the computer designer 
made it possible to use the instruction only in limited contexts, [p. 43] 


More often the instructions are simply overkill—they are too general for the 
most frequent case, resulting in unneeded work and a slower instruction. Again, 
the VAX CALLS is a good example. CALLS uses a callee save strategy (the registers 
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Instruction applu art equake lucas swim FP average 
load 13.8% 18.1% 22.3% 10.6% 9.1% 15% 
store 2.9% 0.8% 3.4% 13% 2% 
add 30.4% 30.1% 17.4% 11.1% 24.4% 23% 
sub 2.5% 0.1% 2.1% 3.8% 2% 
mul 2.3% 12% 1% 
compare 74% 2.1% 2% 
load imm 13.7% 10% 18% 9.4% 5% 
cond branch 2.5% 11.5% 2.9% 0.6% 13% 4% 
cond mov 0.3% 0.1% 0% 
jump 0.1% 0% 
call 0.7% 0% 
return 0.7% 0% 
shift 0.7% 0.2% 19% 1% 
and 0.2% 18% 0% 
or 0.8% 11% 2.3% 10% 7.2% 2% 
xor 3.2% 0.1% 1% 
other logical 0.1% 0% 
load FP 114% 12.0% 19.7% 16.2% 16.8% 15% 
store FP 4.2% 45% 2.7% 18.2% 5.0% 7% 
addFP 2.3% 4.5% 98% 8.2% 9.0% 7% 
subFP 2.9% 13% 7.6% 4.7% 3% 
mulFP 8.6% 4.1% 12.9% 94% 6.9% 8% 
divFP 0.3% 0.6% 0.5% 0.3% 0% 
mov reg-reg FP 0.7% 0.9% 12% 18% 0.9% 1% 
compare FP 0.9% 0.6% 0.8% 0% 
cond mov FP 0.6% 0.8% 0% 
other FP 16% 0% 





Figure B.28 MIPS dynamic instruction mix for five programs from SPECfp2000. Note that integer register-register 


move instructions are included in the or instruction. Blank entries have the value 0.0%. 


to be saved are specified by the callee), but the saving is done by the call instruc- 
tion in the caller. The CALLS instruction begins with the arguments pushed on the 


stack, and then takes the following steps: 


1. Align the stack if needed. 


2. Push the argument count on the stack. 


3. Save the registers indicated by the procedure call mask on the stack (as men- 
tioned in Section B.8). The mask is kept in the called procedure's code—this 
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permits the callee to specify the registers to be saved by the caller even with 
separate compilation. 


4. Push the return address on the stack, and then push the top and base of stack 
pointers (for the activation record). 


Clear the condition codes, which sets the trap enable to a known state. 
Push a word for status information and a zero word on the stack. 


Update the two stack pointers. 


SO RS ON AN 


Branch to the first instruction of the procedure. 


The vast majority of calls in real programs do not require this amount of over- 
head. Most procedures know their argument counts, and a much faster linkage 
convention can be established using registers to pass arguments rather than the 
stack in memory. Furthermore, the CALLS instruction forces two registers to be 
used for linkage, while many languages require only one linkage register. Many 
attempts to support procedure call and activation stack management have failed 
to be useful, either because they do not match the language needs or because they 
are too general and hence too expensive to use. 

The VAX designers provided a simpler instruction, JSB, that is much faster 
since it only pushes the return PC on the stack and jumps to the procedure. 
However, most VAX compilers use the more costly CALLS instructions. The call 
instructions were included in the architecture to standardize the procedure link- 
age convention. Other computers have standardized their calling convention by 
agreement among compiler writers and without requiring the overhead of a com- 
plex, very general procedure call instruction. 


Fallacy There is such a thing as a typical program. 


Pitfall 


Many people would like to believe that there is a single "typical" program that 
could be used to design an optimal instruction set. For example, see the synthetic 
benchmarks discussed in Chapter 1. The data in this appendix clearly show that 
programs can vary significantly in how they use an instruction set. For example, 
Figure B.29 shows the mix of data transfer sizes for four of the SPEC2000 pro- 
grams: It would be hard to say what is typical from these four programs. The 
variations are even larger on an instruction set that supports a class of applica- 
tions, such as decimal instructions, that are unused by other applications. 


Innovating at the instruction set architecture to reduce code size without account- 
ing for the compiler. 


Figure B.30 shows the relative code sizes for four compilers for the MIPS 
instruction set. Whereas architects struggle to reduce code size by 30% to 40%, 
different compiler strategies can change code size by much larger factors. Similar 
to performance optimization techniques, the architect should start with the tight- 
est code the compilers can produce before proposing hardware innovations to 
save space. 
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Figure B.29 Data reference size of four programs from SPEC2000. Although you can 
calculate an average size, it would be hard to claim the average is typical of programs. 





























Green Hills 

Apogee Software Multi2000 Algorithmics 
Compiler Version 4.1 Version 2.0 SDE4.0B IDT% 7.2.1 
Architecture MIPS IV MIPS IV MIPS 32 MIPS 32 
Processor NEC VR5432 NEC VR5000 IDT 32334 IDT 79RC32364 
Autocorrelation kernel 1.0 2.1 11 2.7 
Convolutional encoder kernel 1.0 19 12 24 
Fixed-point bit allocation kernel 10 2.0 12 23 
Fixed-point complex FFT kernel 10 11 2. 18 
Viterbi GSM decoder kernel 10 17 0.8 11 
Geometric mean of five kernels 1.0 17 14 2.0 





Figure B.30 Code size relative to Apogee Software Version 4.1 C compiler for Telecom application of EEMBC 
benchmarks. The instruction set architectures are virtually identical, yet the code sizes vary by factors of 2. These 
results were reported February-June 2000. 


Fallacy An architecture with flaws cannot be successful. 


The 80x86 provides a dramatic example: The instruction set architecture is one 
only its creators could love (see Appendix J). Succeeding generations of Intel 
engineers have tried to correct unpopular architectural decisions made in design- 
ing the 80x86. For example, the 80x86 supports segmentation, whereas all others 
picked paging; it uses extended accumulators for integer data, but other proces- 
sors use general-purpose registers, and it uses a stack for floating-point data, 
when everyone else abandoned execution stacks long before. 

Despite these major difficulties, the 80x86 architecture has been enormously 
successful. The reasons are threefold: first, its selection as the microprocessor in 
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the initial IBM PC makes 80x86 binary compatibility extremely valuable. Sec- 
ond, Moore's Law provided sufficient resources for 80x86 microprocessors to 
translate to an internal RISC instruction set and then execute RISC-like instruc- 
tions. This mix enables binary compatibility with the valuable PC software base 
and performance on par with RISC processors. Third, the very high volumes of 
PC microprocessors means Intel can easily pay for the increased design cost of 
hardware translation. In addition, the high volumes allow the manufacturer to go 
up the learning curve, which lowers the cost of the product. 

The larger die size and increased power for translation may be a liability for 
embedded applications, but it makes tremendous economic sense for the desktop. 
And its cost-performance in the desktop also makes it attractive for servers, with 
its main weakness for servers being 32-bit addresses; which was resolved with 
the 64-bit addresses of AMD64 (see Chapter 5). 


Fallacy You can design a flawless architecture. 


All architecture design involves trade-offs made in the context of a set of hard- 
ware and software technologies. Over time those technologies are likely to 
change, and decisions that may have been correct at the time they were made 
look like mistakes. For example, in 1975 the VAX designers overemphasized the 
importance of code size efficiency, underestimating how important ease of 
decoding and pipelining would be five years later. An example in the RISC camp 
is delayed branch (see Appendix J). It was a simple matter to control pipeline 
hazards with five-stage pipelines, but a challenge for processors with longer pipe- 
lines that issue multiple instructions per clock cycle. In addition, almost all archi- 
tectures eventually succumb to the lack of sufficient address space. 

In general, avoiding such flaws in the long run would probably mean compro- 
mising the efficiency of the architecture in the short run, which is dangerous, 
since a new instruction set architecture must struggle to survive its first few years. 


B.11 Concluding Remarks 


The earliest architectures were limited in their instruction sets by the hardware 
technology ofthat time. As soon as the hardware technology permitted, computer 
architects began looking for ways to support high-level languages. This search 
led to three distinct periods of thought about how to support programs efficiently. 
In the 1960s, stack architectures became popular. They were viewed as being a 
good match for high-level languages—and they probably were, given the com- 
piler technology of the day. In the 1970s, the main concern of architects was how 
to reduce software costs. This concern was met primarily by replacing software 
with hardware, or by providing high-level architectures that could simplify the 
task of software designers. The result was both the high-level language computer 
architecture movement and powerful architectures like the VAX, which has a 
large number of addressing modes, multiple data types, and a highly orthogonal 
architecture. In the 1980s, more sophisticated compiler technology and a 
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renewed emphasis on processor performance saw a return to simpler architec- 
tures, based mainly on the load-store style of computer. 
The following instruction set architecture changes occurred in the 1990s: 


e Address size doubles—The 32-bit address instruction sets for most desktop 
and server processors were extended to 64-bit addresses, expanding the width 
of the registers (among other things) to 64 bits. Appendix J gives three exam- 
ples of architectures that have gone from 32 bits to 64 bits. 


e Optimization of conditional branches via conditional execution—In Chapters 
2 and 3, we see that conditional branches can limit the performance of 
aggressive computer designs. Hence, there was interest in replacing 
conditional branches with conditional completion of operations, such as con- 
ditional move (see Appendix G), which was added to most instruction sets. 


e Optimization of cache performance via prefetch—Chapter 5 explains the 
increasing role of memory hierarchy in performance of computers, with a 
cache miss on some computers taking as many instruction times as page 
faults took on earlier computers. Hence, prefetch instructions were added to 
try to hide the cost of cache misses by prefetching (see Chapter 5). 


e Support for multimedia—Most desktop and embedded instruction sets were 
extended with support for multimedia applications. 


e Faster floating-point operations—Appendix I describes operations added to 
enhance floating-point performance, such as operations that perform a multi- 
ply and an add and paired single execution. (We include them in MIPS.) 


Between 1970 and 1985 many thought the primary job of the computer archi- 
tect was the design of instruction sets. As a result, textbooks of that era empha- 
size instruction set design, much as computer architecture textbooks of the 1950s 
and 1960s emphasized computer arithmetic. The educated architect was expected 
to have strong opinions about the strengths and especially the weaknesses of the 
popular computers. The importance of binary compatibility in quashing innova- 
tions in instruction set design was unappreciated by many researchers and text- 
book writers, giving the impression that many architects would get a chance to 
design an instruction set. 

The definition of computer architecture today has been expanded to include 
design and evaluation of the full computer system—not just the definition of the 
instruction set and not just the processor—and hence there are plenty of topics 
for the architect to study. In fact, the material in this appendix was a central point 
of the book in its first edition in 1990, but now is included in an appendix prima- 
rily as reference material! 
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Appendix J may satisfy readers interested in instruction set architecture: it 
describes a variety of instruction sets, which are either important in the market- 
place today or historically important, and compares nine popular load-store com- 
puters with MIPS. 


Historical Perspective and References 


Section K.3 (available on the companion CD) features a discussion on the evolu- 
tion of instruction sets and includes references for further reading and exploration 
of related topics. 


C.1 
C2 
C.3 
C4 
C5 
C.6 
C7 
C8 


Introduction 

Cache Performance 

Six Basic Cache Optimizations 

Virtual Memory 

Protection and Examples of 
Fallacies and Pitfalls 

Concluding Remarks 

Historical Perspective and References 


Virtual 


Memory 


C2 
C-15 
C-22 
C-38 
C-47 
C-56 
C-57 
C-58 


Review of Memory Hierarchy 


Cache: a safe place for hiding or storing things. 


Webster's New World Dictionary of the 
American Language 
Second College Edition (1976) 
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Introduction 


This appendix is a quick refresher of the memory hierarchy, including the basics 
of cache and virtual memory, performance equations, and simple optimizations. 
This first section reviews the following 36 terms: 


cache fully associative write allocate 
virtual memory dirty bit unified cache 
memory Stall cycles block offset misses per instruction 
direct mapped write back block 

valid bit data cache locality 

block address hit time address trace 

write through cache miss set 

instruction cache page fault random replacement 
average memory access time miss rate indexfield 

cache hit n-way set associative no-write allocate 
page least-recently used write buffer 

miss penalty tag field write stall 


If this review goes too quickly, you might want to look at Chapter 7 in Computer 
Organization and Design, which we wrote for readers with less experience. 

Cache is the name given to the highest or first level of the memory hierarchy 
encountered once the address leaves the processor. Since the principle of locality 
applies at many levels, and taking advantage of locality to improve performance 
is popular, the term cache is now applied whenever buffering is employed to 
reuse commonly occurring items. Examples include file caches, name caches, 
and so on. 

When the processor finds a requested data item in the cache, it is called a 
cache hit. When the processor does not find a data item it needs in the cache, a 
cache miss occurs. A fixed-size collection of data containing the requested word, 
called a block or line run, is retrieved from the main memory and placed into the 
cache. Temporal locality tells us that we are likely to need this word again in the 
near future, so it is useful to place it in the cache where it can be accessed 
quickly. Because of spatial locality, there is a high probability that the other data 
in the block will be needed soon. 

The time required for the cache miss depends on both the latency and band- 
width of the memory. Latency determines the time to retrieve the first word of the 
block, and bandwidth determines the time to retrieve the rest of this block. A 
cache miss is handled by hardware and causes processors using in-order execu- 
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Level 1 2 3 4 
Name registers cache main memory disk storage 
<1KB <16MB < 512 GB >1TB 


Typical size 





Implementation technology 


custom memory with on-chip or off-chip CMOS DRAM magnetic disk 
multiple ports, CMOS CMOS SRAM 














Access time (ns) 0.25-0.5 0.5-25 50-250 5,000,000 
Bandwidth (MB/sec) 50,000-500,000 5000-20,000 2500-10,000 50-500 
Managed by compiler hardware operating system operating 
system/ 
operator 
Backed by cache main memory disk CD or tape 





Figure C.I The typical levels in the hierarchy slow down and get larger as we move away from the processor for 
a large workstation or small server. Embedded computers might have no disk storage, and much smaller memories 
and caches.The access times increase as we move to lower levels of the hierarchy, which makes it feasible to manage 
the transfer less responsively. The implementation technology shows the typical technology used for these func- 
tions. The access time is given in nanoseconds for typical values in 2006; these times will decrease over time. Band- 
width is given in megabytes per second between levels in the memory hierarchy. Bandwidth for disk storage 
includes both the media and the buffered interfaces. 


tion to pause, or stall, until the data are available. With out-of-order execution, an 
instruction using the result must still wait, but other instructions may proceed 
during the miss. 

Similarly, not all objects referenced by a program need to reside in main 
memory. Virtual memory means some objects may reside on disk. The address 
space is usually broken into fixed-size blocks, called pages. At any time, each 
page resides either in main memory or on disk. When the processor references an 
item within a page that is not present in the cache or main memory, a page fault 
occurs, and the entire page is moved from the disk to main memory. Since page 
faults take so long, they are handled in software and the processor is not stalled. 
The processor usually switches to some other task while the disk access occurs. 
From a high-level perspective, the reliance on locality of references and the rela- 
tive relationships in size and relative cost per bit of cache versus main memory 
are similar to those of main memory versus disk. 

Figure C.l shows the range of sizes and access times of each level in the 
memory hierarchy for computers ranging from high-end desktops to low-end 
servers. 


Cache Performance Review 


Because of locality and the higher speed of smaller memories, a memory hierar- 
chy can substantially improve performance. One method to evaluate cache per- 
formance is to expand our processor execution time equation from Chapter 1. 
We now account for the number of cycles during which the processor is stalled 
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waiting for a memory access, which we call the memory stall cycles. The perfor- 
mance is then the product of the clock cycle time and the sum of the processor 
cycles and the memory stall cycles: 


CPU execution time = (CPU clock cycles + Memory stall cycles) x Clock cycle time 


This equation assumes that the CPU clock cycles include the time to handle a 
cache hit, and that the processor is stalled during a cache miss. Section C.2 reex- 
amines this simplifying assumption. 

The number of memory stall cycles depends on both the number of misses 
and the cost per miss, which is called the miss penalty: 


Memory stall cycles = Number of misses x Miss penalty 
Misses r 
—————— x Miss penalty 
Instruction 
Memory accesses 


= [Cx : 
Instruction 


x Miss rate x Miss penalty 

The advantage of the last form is that the components can be easily measured! We 
already know how to measure instruction count. (For speculative processors, we 
only count instructions that commit.) Measuring the number of memory refer- 
ences per instruction can be done in the same fashion; every instruction requires 
an instruction access, and it is easy to decide if it also requires a data access. 

Note that we calculated miss penalty as an average, but we will use it below 
as if it were a constant. The memory behind the cache may be busy at the time of 
the miss because of prior memory requests or memory refresh (see Section 5.3). 
The number of clock cycles also varies at interfaces between different clocks of 
the processor, bus, and memory. Thus, please remember that using a single num- 
ber for miss penalty is a simplification. 

The component miss rate is simply the fraction of cache accesses that result 
in a miss (i.e., number of accesses that miss divided by number of accesses). Miss 
rates can be measured with cache simulators that take an address trace of the 
instruction and data references, simulate the cache behavior to determine which 
references hit and which miss, and then report the hit and miss totals. Many 
microprocessors today provide hardware to count the number of misses and 
memory references, which is a much easier and faster way to measure miss rate. 

The formula above is an approximation since the miss rates and miss penal- 
ties are often different for reads and writes. Memory stall clock cycles could then 
be defined in terms of the number of memory accesses per instruction, miss pen- 
alty (in clock cycles) for reads and writes, and miss rate for reads and writes: 


Memory stall clock cycles = IC x Reads per instruction x Read miss rate x Read miss penalty 


+ IC x Writes per instruction x Write miss rate x Write miss penalty 


We normally simplify the complete formula by combining the reads and writes 
and finding the average miss rates and miss penalty for reads and writes: 


Memory accesses 


Memory stall clock cycles = IC x z 
> Instruction 


x Miss rate X Miss penalty 


Example 


Answer 
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The miss rate is one of the most important measures of cache design, but, as 
we will see in later sections, not the only measure. 


Assume we have a computer where the clocks per instruction (CPI) is 1.0 when 
all memory accesses hit in the cache. The only data accesses are loads and stores, 
and these total 50% of the instructions. If the miss penalty is 25 clock cycles and 
the miss rate is 2%, how much faster would the computer be if all instructions 
were cache hits? 


First compute the performance for the computer that always hits: 


CPU execution time = (CPU clock cycles + Memory stall cycles) x Clock cycle 
(IC x CPI + 0) x Clock cycle 
ICx 10 x Clock cycle 


Now for the computer with the real cache, first we compute memory stall cycles: 
X Memory accesses 

Instruction 
ICx(1 + 0.5) x 0.02x25 
= ICxO.75 


Memory stall cycles = IC x Miss rate x Miss penalty 


where the middle term (1 + 0.5) represents one instruction access and 0.5 dal 
accesses per instruction. The total performance is thus 


CPU execution time€cache = (IC x 10 + IC x 0.75) x Clock cycle 
= 175 xICx Clock cycle 


The performance ratio is the inverse of the execution times: 





CPU execution time ache _ 1.75 x IC x Clock cycle 
CPU execution time 1.0 x IC x Clock cycle 
= NS 


The computer with no cache misses is 1.75 times faster. 


Some designers prefer measuring miss rate as misses per instruction rather 
than misses per memory reference. These two are related: 





Misses _ Miss rate x Memory accesses _ Rifas fase Memory accesses 
Instruction Instruction count x Instruction 


The latter formula is useful when you know the average number of memory 
accesses per instruction because it allows you to convert miss rate into misses per 
instruction, and vice versa. For example, we can turn the miss rate per memory 
reference in the previous example into misses per instruction: 

Misses Memory accesses 


= Miss rate x 


———— : = 002x 1.5 = 0.030 
Instruction Instruction 
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Example 


Answer 


By the way, misses per instruction are often reported as misses per 1000 
instructions to show integers instead of fractions. Thus, the answer above could 
also be expressed as 30 misses per 1000 instructions. 

The advantage of misses per instruction is that it is independent of the hard- 
ware implementation. For example, speculative processors fetch about twice as 
many instructions as are actually committed, which can artificially reduce the 
miss rate if measured as misses per memory reference rather than per instruction. 
The drawback is that misses per instruction is architecture dependent; for exam- 
ple, the average number of memory accesses per instruction may be very differ- 
ent for an 80x86 versus MIPS. Thus, misses per instruction are most popular with 
architects working with a single computer family, although the similarity of 
RISC architectures allows one to give insights into others. 


To show equivalency between the two miss rate equations, let's redo the example 
above, this time assuming a miss rate per 1000 instructions of 30. What is mem- 
ory stall time in terms of instruction count? 


Recomputing the memory stall cycles: 


Memory stall cycles = Number of misses x Miss penalty 
= ICx es. x Miss penalty 
Instruction 
= IC 1000 x ——Misses___y. Miss penalty 
= ae Instruction x 1000 ee RAE 
= IC 71000x30x25 
= IC/L000X750 


= 1ICx075 


We get the same answer as on page C-5, showing equivalence of the two 
equations. 


Four Memory Hierarchy Questions 


We continue our introduction to caches by answering the four common questions 
for the first level of the memory hierarchy: 

Q1: Where can a block be placed in the upper level? {blockplacement) 

Q2: How is a block found if it is in the upper level? (block identification) 

Q3: Which block should be replaced on a miss? (block replacement) 

Q4: What happens on a write? (write strategy) 
The answers to these questions help us understand the different trade-offs of 


memories at different levels of a hierarchy; hence we ask these four questions on 
every example. 
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Q1: Where Cana Block Be Placed ina Cache? 


Figure C.2 shows that the restrictions on where a block is placed create three 
categories of cache organization: 


e If each block has only one place it can appear in the cache, the cache is said to 
be direct mapped. The mapping is usually 


(Block address) MOD (Number of blocks in cache) 
m If a block can be placed anywhere in the cache, the cache is said to be fully 
associative. 


m Ifa block can be placed in a restricted set of places in the cache, the cache is 
set associative. A set is a group of blocks in the cache. A block is first mapped 
onto a set, and then the block can be placed anywhere within that set. The set 
is usually chosen by bit selection; that is, 


(Block address) MOD (Number of sets in cache) 


Fully associative: Direct mapped: Set associative: 

block 12 can go block 12 can go block 12 can go 

anywhere only into block 4 anywhere in set 0 
(12 mod 8) {12 mod 4) 


Block 01234567 Block 01234567 Block 01234567 
no. no. no. 





Cache 
Set Set Set Set 
O T e 8 
Block frame address 
Block 11111141 14222222 2:22:2:3'9 
no. 01234567890123456789012345678901 
+ - 
Memory 


Figure C.2 This example cache has eight block frames and memory has 32 blocks. 
The three options for caches are shown left to right. In fully associative, block 12 from 
the lower level can go into any of the eight block frames of the cache. With direct 
mapped, block 12 can only be placed into block frame 4(12 modulo 8). Set associative, 
which has some of both features, allows the block to be placed anywhere in set 0 (12 
modulo 4). With two blocks per set, this means block 12 can be placed either in block 0 
or in block 1 of the cache. Real caches contain thousands of block frames and real mem- 
ories contain millions of blocks.The set-associative organization has four sets with two 
blocks per set, called two-way set assoc/af/ve. Assume that there is nothing in the cache 
and that the block address in question identifies lower-level block 12. 


C-8 e Appendix C Review of Memory Hierarchy 


If there are n blocks in a set, the cache placement is called n-way set 
associative. 


The range of caches from direct mapped to fully associative is really a continuum 
of levels of set associativity. Direct mapped is simply one-way set associative, 
and a fully associative cache with m blocks could be called "ra-way set associa- 
tive." Equivalently, direct mapped can be thought of as having m sets, and fully 
associative as having one set. 

The vast majority of processor caches today are direct mapped, two-way set 
associative, or four-way set associative, for reasons we will see shortly. 


02: How Is a Block Found If It Is inthe Cache? 


Caches have an address tag on each block frame that gives the block address. The 
tag of every cache block that might contain the desired information is checked to 
see if it matches the block address from the processor. As a rule, all possible tags 
are searched in parallel because speed is critical. 

There must be a way to know that a cache block does not have valid informa- 
tion. The most common procedure is to add a valid bit to the tag to say whether or 
not this entry contains a valid address. If the bit is not set, there cannot be a match 
on this address. 

Before proceeding to the next question, let's explore the relationship of a 
processor address to the cache. Figure C.3 shows how an address is divided. 
The first division is between the block address and the block offset. The block 
frame address can be further divided into the tag field and the index field. The 
block offset field selects the desired data from the block, the index field selects 
the set, and the tag field is compared against it for a hit. Although the compari- 
son could be made on more of the address than the tag, there is no need because 
of the following: 


e The offset should not be used in the comparison, since the entire block is 
present or not, and hence all block offsets result in a match by definition. 


e Checking the index is redundant, since it was used to select the set to be 
checked. An address stored in set 0, for example, must have 0 in the index 
field or it couldn't be stored in set 0; set 1 must have an index value of 1; and 
so on. This optimization saves hardware and power by reducing the width of 
memory size for the cache tag. 





Block address | Block 
Tag Index offset 





Figure C.3 The three portions of an address in a set-associative or direct-mapped 
cache. The tag is used to check all the blocks in the set, and the index is used to select 
the set. The block offset is the address of the desired data within the block. Fully asso- 
ciative caches have no index field. 
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If the total cache size is kept the same, increasing associativity increases the 
number of blocks per set, thereby decreasing the size of the index and increasing 
the size of the tag. That is, the tag-index boundary in Figure C.3 moves to the 
right with increasing associativity, with the end point of fully associative caches 
having no index field. 


Q3: Which Block Should Be Replaced on a Cache Miss? 


When a miss occurs, the cache controller must select a block to be replaced with 
the desired data. A benefit of direct-mapped placement is that hardware decisions 
are simplified—in fact, so simple that there is no choice: Only one block frame is 
checked for a hit, and only that block can be replaced. With fully associative or 
set-associative placement, there are many blocks to choose from on a miss. There 
are three primary strategies employed for selecting which block to replace: 


e Random—To spread allocation uniformly, candidate blocks are randomly 
selected. Some systems generate pseudorandom block numbers to get repro- 
ducible behavior, which is particularly useful when debugging hardware. 


e Least-recently used (LRU)—To reduce the chance of throwing out informa- 
tion that will be needed soon, accesses to blocks are recorded. Relying on the 
past to predict the future, the block replaced is the one that has been unused 
for the longest time. LRU relies on a corollary of locality: If recently used 
blocks are likely to be used again, then a good candidate for disposal is the 
least-recently used block. 


e First in, first out (FIFO)—Because LRU can be complicated to calculate, this 
approximates LRU by determining the oldest block rather than the LRU. 


A virtue of random replacement is that it is simple to build in hardware. As the 
number of blocks to keep track of increases, LRU becomes increasingly 
expensive and is frequently only approximated. Figure C.4 shows the difference 
in miss rates between LRU, random, and FIFO replacement. 


Q4: What Happens ona Write? 


Reads dominate processor cache accesses. All instruction accesses are reads, and 
most instructions don't write to memory. Figure B.27 in Appendix B suggests a 
mix of 10% stores and 26% loads for MIPS programs, making writes 10%/(100% 
+ 26% + 10%) or about 7% of the overall memory traffic. Of the data cache traf- 
fic, writes are 10%/(26% + 10%) or about 28%. Making the common case fast 
means optimizing caches for reads, especially since processors traditionally wait 
for reads to complete but need not wait for writes. Amdahl's Law (Section 1.9) 
reminds us, however, that high-performance designs cannot neglect the speed of 
writes. 

Fortunately, the common case is also the easy case to make fast. The block 
can be read from the cache at the same time that the tag is read and compared, so 
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SE 
Associativity 




















Two-way Four-way Eight-way 
Size LRU Random FIFO LRU Random FIFO LRU Random FIFO 
16 KB 114.1 117.3 115.5 111.7 115.1 113.3 109.0 111.8 110.4 
64 KB 103.4 104.3 103.9 102.4 102.3 103.1 99,7 100.5 100.3 
256 KB 92.2 92.1 92.5 92.1 92.1 92.5 92.1 92.1 92.5 


Figure C.4 Data cache misses per 1000 instructions comparing least-recently used, random, and first in, first out 
replacement for several sizes and associativities. There is little difference between LRU and random for the largest- 
size cache, with LRU outperforming the others for smaller caches. FIFO generally outperforms random in the smaller 
cache sizes. These data were collected for a block size of 64 bytes for the Alpha architecture using 10 SPEC2000 
benchmarks. Five are from SPECint2000 (gap, gcc, gzip, mcf, and perl) and five are from SPECfp2000 (applu, art, 
equake, lucas,and swim). We will use this computer and these benchmarks in most figures in this appendix. 


the block read begins as soon as the block address is available. If the read is a hit, 
the requested part of the block is passed on to the processor immediately. Ifit is a 
miss, there is no benefit—but also no harm except more power in desktop and 
server computers; just ignore the value read. 

Such optimism is not allowed for writes. Modifying a block cannot begin 
until the tag is checked to see if the address is a hit. Because tag checking cannot 
occur in parallel, writes normally take longer than reads. Another complexity is 
that the processor also specifies the size of the write, usually between 1 and 8 
bytes; only that portion of a block can be changed. In contrast, reads can access 
more bytes than necessary without fear. 

The write policies often distinguish cache designs. There are two basic 
options when writing to the cache: 


e Write through—The information is written to both the block in the cache and 
to the block in the lower-level memory. 


e Write back—The information is written only to the block in the cache. The 
modified cache block is written to main memory only when it is replaced. 


To reduce the frequency of writing back blocks on replacement, a feature 
called the dirty bit is commonly used. This status bit indicates whether the block 
is dirty (modified while in the cache) or clean (not modified). If it is clean, the 
block is not written back on a miss, since identical information to the cache is 
found in lower levels. 

Both write back and write through have their advantages. With write back, 
writes occur at the speed of the cache memory, and multiple writes within a block 
require only one write to the lower-level memory. Since some writes don't go to 
memory, write back uses less memory bandwidth, making write back attractive in 
multiprocessors. Since write back uses the rest of the memory hierarchy and 
memory interconnect less than write through, it also saves power, making it 
attractive for embedded applications. 


Example 


Answer 
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Write through is easier to implement than write back. The cache is always 
clean, so unlike write back read misses never result in writes to the lower level. 
Write through also has the advantage that the next lower level has the most cur- 
rent copy of the data, which simplifies data coherency. Data coherency is impor- 
tant for multiprocessors and for I/O, which we examine in Chapters 4 and 6. 
Multilevel caches make write through more viable for the upper-level caches, as 
the writes need only propagate to the next lower level rather than all the way to 
main memory. 

As we will see, I/O and multiprocessors are fickle: They want write back for 
processor caches to reduce the memory traffic and write through to keep the 
cache consistent with lower levels of the memory hierarchy. 

When the processor must wait for writes to complete during write through, 
the processor is said to write stall. A common optimization to reduce write stalls 
is a write buffer, which allows the processor to continue as soon as the data are 
written to the buffer, thereby overlapping processor execution with memory 
updating. As we will see shortly, write stalls can occur even with write buffers. 

Since the data are not needed on a write, there are two options on a 
write miss: 


e Write allocate —The block is allocated on a write miss, followed by the write 
hit actions above. In this natural option, write misses act like read misses. 


e No-write allocate—This apparently unusual alternative is write misses do not 
affect the cache. Instead, the block is modified only in the lower-level memory. 


Thus, blocks stay out of the cache in no-write allocate until the program tries to 
read the blocks, but even blocks that are only written will still be in the cache 
with write allocate. Let's look at an example. 


Assume a fully associative write-back cache with many cache entries that starts 
empty. Below is a sequence of five memory operations (the address is in square 
brackets): 


Write Mem[100]; 
WriteMem[100]; 

Read Mem[200]; 
WriteMem[200]; 
WriteMem[100]. 


What are the number of hits and misses when using no-write allocate versus write 
allocate? 


For no-write allocate, the address 100 is not in the cache, and there is no alloca- 
tion on write, so the first two writes will result in misses. Address 200 is also not 
in the cache, so the read is also a miss. The subsequent write to address 200 is a 
hit. The last write to 100 is still a miss. The result for no-write allocate is four 
misses and one hit. 
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For write allocate, the first accesses to 100 and 200 are misses, and the rest 
are hits since 100 and 200 are both found in the cache. Thus, the result for write 
allocate is two misses and three hits. 


Either write miss policy could be used with write through or write back. Nor- 
mally, write-back caches use write allocate, hoping that subsequent writes to that 
block will be captured by the cache. Write-through caches often use no-write 
allocate. The reasoning is that even if there are subsequent writes to that block, 
the writes must still go to the lower-level memory, so what's to be gained? 


An Example:The Opteron Data Cache 


To give substance to these ideas, Figure C.5 shows the organization of the data 
cache in the AMD Opteron microprocessor. The cache contains 65,536 (64K) 
bytes of data in 64-byte blocks with two-way set-associative placement, least- 
recently used replacement, write back, and write allocate on a write miss. 

Let's trace a cache hit through the steps of a hit as labeled in Figure C.5. (The 
four steps are shown as circled numbers.) As described in Section C.5, the 
Opteron presents a 48-bit virtual address to the cache for tag comparison, which 
is simultaneously translated into a 40-bit physical address. 

The reason Opteron doesn't use all 64 bits of virtual address is that its design- 
ers don't think anyone needs that big of a virtual address space yet, and the 
smaller size simplifies the Opteron virtual address mapping. The designers plan 
to grow the virtual address in future microprocessors. 

The physical address coming into the cache is divided into two fields: the 34- 
bit block address and the 6-bit block offset (64 = 2° and 34+ 6 = 40). The block 
address is further divided into an address tag and cache index. Step 1 shows this 
division. 

The cache index selects the tag to be tested to see if the desired block is in the 
cache. The size of the index depends on cache size, block size, and set associativ- 
ity. For the Opteron cache the set associativity is set to two, and we calculate the 
index as follows: 


index = Cache size _ 65,536 _ 512 = 49 
Block size x Set associativity 64x2 ~~~ 





Hence, the index is 9 bits wide, and the tag is 34 - 9 or 25 bits wide. Although 
that is the index needed to select the proper block, 64 bytes is much more than the 
processor wants to consume at once. Hence, it makes more sense to organize the 
data portion of the cache memory 8 bytes wide, which is the natural data word of 
the 64-bit Opteron processor. Thus, in addition to 9 bits to index the proper cache 
block, 3 more bits from the block offset are used to index the proper 8 bytes. 
Index selection is step 2 in Figure C.5. 

After reading the two tags from the cache, they are compared to the tag por- 
tion of the block address from the processor. This comparison is step 3 in the fig- 
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Figure C.5 The organization of the data cache in the Opteron microprocessor. The 
64 KB cache is two-way set associative with 64-byte blocks. The 9-bit index selects 
among 512 sets. The four steps of a read hit, shown as circled numbers in order of 
occurrence, label this organization. Three bits of the block offset join the index to sup- 
ply the RAM address to select the proper 8 bytes.Thus, the cache holds two groups of 
4096 64-bit words, with each group containing half of the 512 sets. Although not exer- 
cised in this example, the line from lower-level memory to the cache is used on a miss 
to load the cache. The size of address leaving the processor is 40 bits because it is a 
physical address and not a virtual address. Figure C.23 on page C-45 explains how the 
Opteron maps from virtual to physical for a cache access. 


ure. To be sure the tag contains valid information, the valid bit must be set or else 
the results of the comparison are ignored. 

Assuming one tag does match, the final step is to signal the processor to load 
the proper data from the cache by using the winning input from a 2:1 multiplexor. 
The Opteron allows 2 clock cycles for these four steps, so the instructions in the 
following 2 clock cycles would wait if they tried to use the result of the load. 

Handling writes is more complicated than handling reads in the Opteron, as it 
is in any cache. If the word to be written is in the cache, the first three steps are 
the same. Since the Opteron executes out of order, only after it signals that the 
instruction has committed and the cache tag comparison indicates a hit are the 
data written to the cache. 

So far we have assumed the common case of a cache hit. What happens on a 
miss? On a read miss, the cache sends a signal to the processor telling it the data 
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are not yet available, and 64 bytes are read from the next level of the hierarchy. 
The latency is 7 clock cycles to the first 8 bytes of the block, and then 2 clock 
cycles per 8 bytes for the rest of the block. Since the data cache is set associative, 
there is a choice on which block to replace. Opteron uses LRU, which selects the 
block that was referenced longest ago, so every access must update the LRU bit. 
Replacing a block means updating the data, the address tag, the valid bit, and the 
LRU bit. 

Since the Opteron uses write back, the old data block could have been modi- 
fied, and hence it cannot simply be discarded. The Opteron keeps 1 dirty bit per 
block to record if the block was written. If the "victim" was modified, its data and 
address are sent to the Victim Buffer. (This structure is similar to a write buffer in 
other computers.) The Opteron has space for eight victim blocks. In parallel with 
other cache actions, it writes victim blocks to the next level of the hierarchy. If 
the Victim Buffer is full, the cache must wait. 

A write miss is very similar to a read miss, since the Opteron allocates a 
block on a read or a write miss. 

We have seen how it works, but the data cache cannot supply all the mem- 
ory needs of the processor: The processor also needs instructions. Although a 
single cache could try to supply both, it can be a bottleneck. For example, when 
a load or store instruction is executed, the pipelined processor will simulta- 
neously request both a data word and an instruction word. Hence, a single 
cache would present a structural hazard for loads and stores, leading to stalls. 
One simple way to conquer this problem is to divide it: One cache is dedicated 
to instructions and another to data. Separate caches are found in most recent 
processors, including the Opteron. Hence, it has a 64 KB instruction cache as 
well as the 64 KB data cache. 

The processor knows whether it is issuing an instruction address or a data 
address, so there can be separate ports for both, thereby doubling the bandwidth 
between the memory hierarchy and the processor. Separate caches also offer the 
opportunity of optimizing each cache separately: Different capacities, block 
sizes, and associativities may lead to better performance. (In contrast to the 
instruction caches and data caches of the Opteron, the terms unified or mixed are 
applied to caches that can contain either instructions or data.) 

Figure C.6 shows that instruction caches have lower miss rates than data 
caches. Separating instructions and data removes misses due to conflicts between 
instruction blocks and data blocks, but the split also fixes the cache space devoted 
to each type. Which is more important to miss rates? A fair comparison of sepa- 
rate instruction and data caches to unified caches requires the total cache size to 
be the same. For example, a separate 16 KB instruction cache and 16 KB data 
cache should be compared to a 32 KB unified cache. Calculating the average 
miss rate with separate instruction and data caches necessitates knowing the per- 
centage of memory references to each cache. Figure B.27 on page B-41 suggests 
the split is 100%/(100% + 26% + 10%) or about 74% instruction references to 
(26% + 10%)/(100% + 26% + 10%) or about 26% data references. Splitting 
affects performance beyond what is indicated by the change in miss rates, as we 
will see shortly. 


C.2 


Example 
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Instruction Unified 
Size cache Data cache cache 
8KB 8.16 44.0 63.0 
16KB 3.82 40.9 51.0 
32KB 1.36 38.4 43.3 
64KB 0.61 36.9 39.4 
128KB 0.30 35.3 36.2 
256KB 0.02 32.6 32.9 





Figure C.6 Miss per 1000 instructions for instruction, data, and unified caches of dif- 
ferent sizes. The percentage of instruction references is about 74%. The data are for 
two-way associative caches with 64-byte blocks for the same computer and bench- 
marks as Figure C.4. 


Cache Performance 


Because instruction count is independent of the hardware, it is tempting to evaluate 
processor performance using that number. Such indirect performance measures 
have waylaid many a computer designer. The corresponding temptation for evaluat- 
ing memory hierarchy performance is to concentrate on miss rate because it, too, is 
independent of the speed of the hardware. As we will see, miss rate can be just as 
misleading as instruction count. A better measure of memory hierarchy perfor- 
mance is the average memory access time: 


Average memory access time = Hit time + Miss rate X Miss penalty 


where Hit time is the time to hit in the cache; we have seen the other two terms 
before. The components of average access time can be measured either in abso- 
lute time—say, 0.25 to 1.0 nanoseconds on a hit—or in the number of clock 
cycles that the processor waits for the memory—such as a miss penalty of 150 to 
200 clock cycles. Remember that average memory access time is still an indirect 
measure of performance; although it is a better measure than miss rate, it is not a 
substitute for execution time. 
This formula can help us decide between split caches and a unified cache. 


Which has the lower miss rate: a 16 KB instruction cache with a 16 KB data 
cache or a 32 KB unified cache? Use the miss rates in Figure C.6 to help calculate 
the correct answer, assuming 36% of the instructions are data transfer instruc- 
tions. Assume a hit takes 1 clock cycle and the miss penalty is 100 clock cycles. 
A load or store hit takes 1 extra clock cycle on a unified cache if there is only one 
cache port to satisfy two simultaneous requests. Using the pipelining terminology 
of Chapter 2, the unified cache leads to a structural hazard. What is the average 
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Answer 


memory access time in each case? Assume write-through caches with a write 
buffer and ignore stalls due to the write buffer. 


First let's convert misses per 1000 instructions into miss rates. Solving the gen- 
eral formula from above, the miss rate is 
Misses 
. 1000 Instructions 
Miss rate = 
Memory accesses 


000 


Instruction 
Since every instruction access has exactly one memory access to fetch the 
instruction, the instruction miss rate is 


3.82/1000 


KB instruction 1.00 = 0.004 


Miss rate}, 


Since 36% of the instructions are data transfers, the data miss rate is 


40.9/1000 


ga OT 


Miss rate 16 KB data 


The unified miss rate needs to account for instruction and data accesses: 


43.31000 


naroa 0918 


Missrate32KBunified = 


As stated above, about 74% of the memory accesses are instruction references. 
Thus, the overall miss rate for the split caches is 


(74% x 0.004) + (26% x 0.114) = 0.0326 


Thus, a 32 KB unified cache has a slightly lower effective miss rate than two 16 
KB caches. 

The average memory access time formula can be divided into instruction and 
data accesses: 


Average memory access time 
= % instructions x (Hit time + Instruction miss rate x Miss penalty) 
+ % data x (Hit time + Data miss rate x Miss penalty) 


Therefore, the time for each organization is 


Average memory access time. i; 


= 74% x (1 + 0.004 x 200) + 26% x (1 + 0.114 x 200) 
= (74% x 1.80) +(26% x 23.80) = 1332 + 6.188 = 7.52 


Average memory access OME, pitied 


= 74% x (1 + 0.0318 x 200) + 26% x (1 + 1 + 0.0318 x 200) 
= (74% x 7.36) +(26% x 8.36) = 5.446 + 2.174 = 7.62 


Example 


Answer 
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Hence, the split caches in this example—which offer two memory ports per clock 
cycle, thereby avoiding the structural hazard—have a better average memory 
access time than the single-ported unified cache despite having a worse effective 
miss rate. 


Average Memory Access Time and Processor Performance 


An obvious question is whether average memory access time due to cache misses 
predicts processor performance. 

First, there are other reasons for stalls, such as contention due to I/O devices 
using memory. Designers often assume that all memory stalls are due to cache 
misses, since the memory hierarchy typically dominates other reasons for stalls. 
We use this simplifying assumption here, but beware to account for all memory 
stalls when calculating final performance. 

Second, the answer depends also on the processor. If we have an in-order exe- 
cution processor (see Chapter 2), then the answer is basically yes. The processor 
stalls during misses, and the memory stall time is strongly correlated to average 
memory access time. Let's make that assumption for now, but we'll return to out- 
of-order processors in the next subsection. 

As stated in the previous section, we can model CPU time as 


CPU time = (CPU execution clock cycles + Memory stall clock cycles) x Clock cycle time 


This formula raises the question of whether the clock cycles for a cache hit 
should be considered part of CPU execution clock cycles or part of memory stall 
clock cycles. Although either convention is defensible, the most widely accepted 
is to include hit clock cycles in CPU execution clock cycles. 

We can now explore the impact of caches on performance. 


Let's use an in-order execution computer for the first example. Assume the cache 
miss penalty is 200 clock cycles, and all instructions normally take 1.0 clock 
cycles (ignoring memory stalls). Assume the average miss rate is 2%, there is an 
average of 15 memory references per instruction, and the average number of 
cache misses per 1000 instructions is 30. What is the impact on performance 
when behavior of the cache is included? Calculate the impact using both misses 
per instruction and miss rate. 

Memory stall clock cycles 
execution Instruction 





CPU time = IC x (cer )x Clock cycle time 


The performance, including cache misses, is 


CPU timewithcache = IC X (1.0 + (30/1000 X 200)) X Clock cycle time 
= IC X 7.00 X Clock cycle time 
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CPU time = IC x (cr 


Example 


Now calculating performance using miss rate: 


à Memory accesses 
sxecution + Miss rate x — Aa 
ere Instruction 


x Miss penalty ) x Clock cycle time 
CPU timewithcache = IC X (1.0 + (1.5 X 2% X 200)) X Clock cycle time 
= IC X 7.00 X Clock cycle time 


The clock cycle time and instruction count are the same, with or without a 
cache. Thus, CPU time increases sevenfold, with CPI from 1.00 for a "perfect 
cache" to 7.00 with a cache that can miss. Without any memory hierarchy at all 
the CPI would increase again to 1.0 + 200 x 1.5 or 301—a factor of more than 40 
times longer than a system with a cache! 


As this example illustrates, cache behavior can have enormous impact on per- 
formance. Furthermore, cache misses have a double-barreled impact on a proces- 
sor with a low CPI and a fast clock: 


1. The lower the CPlexecution, the higher the relative impact of a fixed number of 
cache miss clock cycles. 


2. When calculating CPI, the cache miss penalty is measured in processor clock 
cycles for a miss. Therefore, even if memory hierarchies for two computers 
are identical, the processor with the higher clock rate has a larger number of 
clock cycles per miss and hence a higher memory portion of CPI. 


The importance of the cache for processors with low CPI and high clock rates is 
thus greater, and, consequently, greater is the danger of neglecting cache 
behavior in assessing performance of such computers. Amdahl's Law strikes 
again! 

Although minimizing average memory access time is a reasonable goal— 
and we will use it in much of this appendix—keep in mind that the final goal is 
to reduce processor execution time. The next example shows how these two can 
differ. 


What is the impact of two different cache organizations on the performance of a 
processor? Assume that the CPI with a perfect cache is 1.6, the clock cycle time 
is 0.35 ns, there are 14 memory references per instruction, the size of both 
caches is 128 KB, and both have a block size of 64 bytes. One cache is direct 
mapped and the other is two-way set associative. Figure C.5 shows that for set- 
associative caches we must add a multiplexor to select between the blocks in the 
set depending on the tag match. Since the speed of the processor can be tied 
directly to the speed of a cache hit, assume the processor clock cycle time must 
be stretched 1.35 times to accommodate the selection multiplexor of the set-asso- 
ciative cache. To the first approximation, the cache miss penalty is 65 ns for 
either cache organization. (In practice, it is normally rounded up or down to an 
integer number of clock cycles.) First, calculate the average memory access time 


Answer 
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and then processor performance. Assume the hit time is 1 clock cycle, the miss 
rate of a direct-mapped 128 KB cache is 2.1%, and the miss rate for a two-way 
set-associative cache of the same size is 1.9%. 


Average memory access time is 
Average memory access time = Hit time + Miss rate x Miss penalty 
Thus, the time for each organization is 


Average memory access timei_way = 0.35 + (.021 X 65) = 1.72 ns 
Average memory access timez_yay = 0.35 X 1.35 + (.019 X 65) = 1.71 ns 


The average memory access time is better for the two-way set-associative cache. 
The processor performance is 


Misses 
execution Instruction 


CPU time = IC X ( CPI x Miss penalty } x Clock cycle time 


IC x | (CP asia X Clock cycle time) 


| 
| 


Memory accesses 


- x Miss penalty x Clock cycle time) 
Instruction ý 


+ (Miss rate x 


Substituting 65 ns for (Miss penalty x Clock cycle time), the performance of each 
cache organization is 


CPUtimej.^y = ICx (1.6x0.35+ (0.021x1.4x65)) = 247 xIC 
CPUtimes_way = IC x (1.6 x 0.35 x 135 + (0.019 x 14 x 65)) = 2.49 x IC 


and relative performance is 


CPU times way _ 2.49 x Instruction count _ 2.49 _ | ¢y 
CPU time; yay 2-47 x Instruction count 2.47 j 


In contrast to the results of average memory access time comparison, the direct- 
mapped cache leads to slightly better average performance because the clock 
cycle is stretched for all instructions for the two-way set-associative case, even if 
there are fewer misses. Since CPU time is our bottom-line evaluation, and since 
direct mapped is simpler to build, the preferred cache is direct mapped in this 
example. 


Miss Penalty and Out-of-Order Execution Processors 


For an out-of-order execution processor, how do you define "miss penalty"? Is it 
the full latency of the miss to memory, or is it just the "exposed" or nonover- 
lapped latency when the processor must stall? This question does not arise in pro- 
cessors that stall until the data miss completes. 
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Let's redefine memory stalls to lead to a new definition of miss penalty as 
nonoverlapped latency: 


Meniory stall cycles = AR x (Total miss latency — Overlapped miss latency) 

Instruction Instruction x 
Similarly, as some out-of-order processors stretch the hit time, that portion of the 
performance equation could be divided by total hit latency less overlapped hit 
latency. This equation could be further expanded to account for contention for 
memory resources in an out-of-order processor by dividing total miss latency into 
latency without contention and latency due to contention. Let's just concentrate 
on miss latency. 

We now have to decide the following: 


e Length ofmemory latency—What to consider as the start and the end of a 
memory operation in an out-of-order processor 


e Length of latency overlap—What is the start of overlap with the processor (or 
equivalently, when do we say a memory operation is stalling the processor) 


Given the complexity of out-of-order execution processors, there is no single cor- 
rect definition. 

Since only committed operations are seen at the retirement pipeline stage, we 
say a processor is stalled in a clock cycle if it does not retire the maximum possi- 
ble number of instructions in that cycle. We attribute that stall to the first instruc- 
tion that could not be retired. This definition is by no means foolproof. For 
example, applying an optimization to improve a certain stall time may not always 
improve execution time because another type of stall—hidden behind the targeted 
stall—may now be exposed. 

For latency, we could start measuring from the time the memory instruction is 
queued in the instruction window, or when the address is generated, or when the 
instruction is actually sent to the memory system. Any option works as long as it 
is used in a consistent fashion. 


Let's redo the example above, but this time we assume the processor with the 
longer clock cycle time supports out-of-order execution yet still has a direct- 
mapped cache. Assume 30% of the 65 ns miss penalty can be overlapped; that is, 
the average CPU memory stall time is now 45.5 ns. 


Average memory access time for the out-of-order (OOO) computer is 
Average memory access time^^ooo7 ®-^5* 1-35 + (0.021 x 45.5) = 143 ns 
The performance of the OOO cache is 


CPU time | ay,000 = IC x (1.6 x 0.35 x 1.35 + (0.021 x 1.4 x 45.5)) = 2.09 x IC 


index 


CPU execution time 


Memory stall cycles 
Memory stall cycles 


Misses 
Instruction 


Average memory access time 


CPU execution time 
CPU execution time 


CPU execution time 


Memory stall cycles 
Instruction 
Average memory access time 
Memory stall cycles 
y 3 
Instruction 
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Hence, despite a much slower clock cycle time and the higher miss rate of a 
direct-mapped cache, the out-of-order computer can be slightly faster if it can 
hide 30% of the miss penalty. 


In summary, although the state of the art in defining and measuring memory 
stalls for out-of-order processors is complex, be aware of the issues because they 
significantly affect performance. The complexity arises because out-of-order pro- 
cessors tolerate some latency due to cache misses without hurting performance. 
Consequently, designers normally use simulators of the out-of-order processor 
and memory when evaluating trade-offs in the memory hierarchy to be sure that 
an improvement that helps the average memory latency actually helps program 
performance. 

To help summarize this section and to act as a handy reference, Figure C.7 
lists the cache equations in this appendix. 


Cache size 
Block size x Set associativity 
= (CPU clock cyeles + Memory stall cycles) x Clock cycle time 
= Number of misses x Miss penalty 

Misses i 
= IC x —— x Miss penalty 

Instruction 
‘ Memory accesses 
= Miss rate x —— 
Instruction 


= Hit time + Miss rate x Miss penalty 


Memory stall clock cycles x 
= ICX (cr, i LA x Clock cycle time 
econ Instruction - 

4 5 Misses . — . 
= ICx Gorm Tac x Miss penalty | x Clock cycle time 

` g Memory accesses p ET 
= ICM (Piin + Miss rate x “Tia x Miss penalty x Clock cycle time 

Misses 


= ———— x (Total miss latency — Overlapped miss latency) 
Instruction 


= Hit time, ; + Miss rate, , x (Hit time, + Miss rate, , x Miss penalty, 5) 


Misses; , aa Misses; > 
— x Hit time, 5 - 


x Miss penalty; 5 


= ————— + ——__— 
Instruction Instruction 





Figure C.7 Summary of performance equations in this appendix. The first equation calculates the cache index 
size, and the rest help evaluate performance. The final two equations deal with multilevel caches, which are 
explained early in the next section. They are included here to help make the figure a useful reference. 
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C.3 


Six Basic Cache Optimizations 


The average memory access time formula gave us a framework to present cache 
optimizations for improving cache performance: 


Average memory access time = Hit time + Miss rate x Miss penalty 


Hence, we organize six cache optimizations into three categories: 


e Reducing the miss rate: larger block size, larger cache size, and higher asso- 
ciativity 

e Reducing the miss penalty: multilevel caches and giving reads priority over 
writes 


e Reducing the time to hit in the cache: avoiding address translation when 
indexing the cache 


Figure C.17 on page C-39 concludes this section with a summary of the imple- 
mentation complexity and the performance benefits of these six techniques. 

The classical approach to improving cache behavior is to reduce miss rates, and 
we present three techniques to do so. To gain better insights into the causes of 
misses, we first start with a model that sorts all misses into three simple categories: 


e Compulsory—The very first access to a block cannot be in the cache, so the 
block must be brought into the cache. These are also called cold-start misses 
ox first-reference misses. 


e Capacity—TIf the cache cannot contain all the blocks needed during execution 
of a program, capacity misses (in addition to compulsory misses) will occur 
because of blocks being discarded and later retrieved. 


e Conflict—If the block placement strategy is set associative or direct mapped, 
conflict misses (in addition to compulsory and capacity misses) will occur 
because a block may be discarded and later retrieved if too many blocks map 
to its set. These misses are also called collision misses. The idea is that hits in 
a fully associative cache that become misses in an w-way set-associative 
cache are due to more than n requests on some popular sets. 


(Chapter 4 adds a fourth C, for Coherency misses due to cache flushes to keep 
multiple caches coherent in a multiprocessor; we won't consider those here.) 

Figure C.8 shows the relative frequency of cache misses, broken down by 
the "three C's." Compulsory misses are those that occur in an infinite cache. 
Capacity misses are those that occur in a fully associative cache. Conflict misses 
are those that occur going from fully associative to eight-way associative, four- 
way associative, and so on. Figure C.9 presents the same data graphically. The 
top graph shows absolute miss rates; the bottom graph plots the percentage of all 
the misses by type of miss as a function of cache size. 


C.3 Six Basic Cache Optimizations •° C-23 


Miss rate components (relative percent) 
(sum = 100% of total miss rate) 








































































































Degree Total miss @§———— 
Cache size (KB) associative rate Compulsory Capacity Conflict 

4 1-way 0.098 0.0001 0.1% 0.070 72% 0.027 28% 

4 2-way 0.076 0.0001 0.1% 0.070 93% 0.005 7% 

4 4-way 0.071 0.0001 0.1% 0.070 99% 0.001 1% 

4 8-way 0.071 0.0001 0.1% 0.070 100% 0.000 0% 

8 l-way 0.068 0.0001 0.1% 0.044 65% 0.024 35% 

8 2-way 0.049 0.0001 0.1% 0.044 90% 0.005 10% 

8 4-way 0.044 0.0001 0.1% 0.044 99% 0.000 1% 

8 8-way 0.044 0.0001 0.1% 0.044 100% 0.000 0% 

16 l-way 0.049 0.0001 0.1% 0.040 82% 0.009 17% 
16 2-way 0.041 0.0001 0.2% 0.040 98% 0.001 2% 
16 4-way 0.041 0.0001 0.2% 0.040 99% 0.000 0% 
16 8-way 0.041 0.0001 0.2% 0.040 100% 0.000 0% 
32 l-way 0.042 0.0001 0.2% 0.037 89% 0.005 11% 
32 2-way 0.038 0.0001 0.2% 0.037 99% 0.000 0% 
32 4-way 0.037 0.0001 0.2% 0.037 100% 0.000 0% 
32 8-way 0.037 0.0001 0.2% 0.037 100% 0.000 0% 
64 l-way 0.037 0.0001 0.2% 0.028 711% 0.008 23% 
64 2-way 0.031 0.0001 0.2% 0.028 91% 0.003 9% 
64 4-way 0.030 0.0001 0.2% 0.028 95% 0.001 4% 
64 8-way 0.029 0.0001 0.2% 0.028 97% 0.001 2% 
128 1-way 0.021 0.0001 0.3% 0.019 91% 0.002 8% 
128 2-way 0.019 0.0001 0.3% 0.019 100% 0.000 0% 
128 4-way 0.019 0.0001 0.3% 0.019 100% 0.000 0% 
128 8-way 0.019 0.0001 0.3% 0.019 100% 0.000 0% 
256 l-way 0.013 0.0001 0.5% 0.012 94% 0.001 6% 
256 2-way 0.012 0.0001 0.5% 0.012 99% 0.000 0% 
256 4-way 0.012 0.0001 0.5% 0.012 99% 0.000 0% 
256 8-way 0.012 0.0001 0.5% 0.012 99% 0.000 0% 
512 l-way 0.008 0.0001 0.8% 0.005 66% 0.003 33% 
512 2-way 0.007 0.0001 0.9% 0.005 71% 0.002 28% 
512 4-way 0.006 0.0001 1.1% 0.005 91% 0.000 8% 
512 8-way 0.006 0.0001 1.1% 0.005 95% 0.000 4% 





Figure C.8 Total miss rate for each size cache and percentage of each according to the "three C's." Compulsory 
misses are independent of cache size, while capacity misses decrease as capacity increases, and conflict misses 
decrease as associativity increases. Figure C.9 shows the same information graphically. Note that a direct-mapped 
cache of size Nhas about the same miss rate as a two-way set-associative cache of size N/2 up through 128 K. Caches 
larger than 128 KB do not prove that rule. Note that the Capacity column is also the fully associative miss rate. Data 
were collected as in Figure C.4 using LRU replacement. 
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Figure C.9 Total miss rate (top) and distribution of miss rate (bottom) for each size 
cache according to the three Cs for the data in Figure C.8. The top diagram is the 
actual data cache miss rates, while the bottom diagram shows the percentage in each 
category. (Space allows the graphs to show one extra cache size than can fit in 
Figure C8.) 


To show the benefit of associativity, conflict misses are divided into misses 
caused by each decrease in associativity. Here are the four divisions of conflict 
misses and how they are calculated: 


e Ejight-way—Conflict misses due to going from fully associative (no conflicts) 
to eight-way associative 


e Four-way—Conflict misses due to going from eight-way associative to four- 
way associative 
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e Two-way—Conflict misses due to going from four-way associative to two- 
way associative 


e One-way—Conflict misses due to going from two-way associative to one- 
way associative (direct mapped) 


As we can see from the figures, the compulsory miss rate of the SPEC2000 
programs is very small, as it is for many long-running programs. 

Having identified the three C's, what can a computer designer do about them? 
Conceptually, conflicts are the easiest: Fully associative placement avoids all 
conflict misses. Full associativity is expensive in hardware, however, and may 
slow the processor clock rate (see the example on page C-28), leading to lower 
overall performance. 

There is little to be done about capacity except to enlarge the cache. If the 
upper-level memory is much smaller than what is needed for a program, and a 
significant percentage of the time is spent moving data between two levels in the 
hierarchy, the memory hierarchy is said to thrash. Because so many replacements 
are required, thrashing means the computer runs close to the speed of the lower- 
level memory, or maybe even slower because of the miss overhead. 

Another approach to improving the three C's is to make blocks larger to 
reduce the number of compulsory misses, but, as we will see shortly, large blocks 
can increase other kinds of misses. 

The three C's give insight into the cause of misses, but this simple model 
has its limits; it gives you insight into average behavior but may not explain an 
individual miss. For example, changing cache size changes conflict misses as 
well as capacity misses, since a larger cache spreads out references to more 
blocks. Thus, a miss might move from a capacity miss to a conflict miss as 
cache size changes. Note that the three C's also ignore replacement policy, 
since it is difficult to model and since, in general, it is less significant. In spe- 
cific circumstances the replacement policy can actually lead to anomalous 
behavior, such as poorer miss rates for larger associativity, which contradicts 
the three C's model. (Some have proposed using an address trace to determine 
optimal placement in memory to avoid placement misses from the three C's 
model; we've not followed that advice here.) 

Alas, many of the techniques that reduce miss rates also increase hit time or 
miss penalty. The desirability of reducing miss rates using the three optimizations 
must be balanced against the goal of making the whole system fast. This first 
example shows the importance of a balanced perspective. 


First Optimization: Larger Block Size to Reduce Miss Rate 


The simplest way to reduce miss rate is to increase the block size. Figure CIO 
shows the trade-off of block size versus miss rate for a set of programs and cache 
sizes. Larger block sizes will reduce also compulsory misses. This reduction 
occurs because the principle of locality has two components: temporal locality 
and spatial locality. Larger blocks take advantage of spatial locality. 
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Block size 


Figure C.I 0 Miss rate versus block size for five different-sized caches. Note that miss 
rate actually goes up if the block size is too large relative to the cache size. Each line rep- 
resents a cache of different size. Figure C.11 shows the data used to plot these lines. 
Unfortunately, SPEC2000 traces would take too long if block size were included, so 
these data are based on SPEC92 on a DECstation 5000 [Gee et al. 1993]. 


At the same time, larger blocks increase the miss penalty. Since they reduce 
the number of blocks in the cache, larger blocks may increase conflict misses and 
even capacity misses if the cache is small. Clearly, there is little reason to 
increase the block size to such a size that it increases the miss rate. There is also 
no benefit to reducing miss rate if it increases the average memory access time. 
The increase in miss penalty may outweigh the decrease in miss rate. 


Figure C. 11 shows the actual miss rates plotted in Figure CIO. Assume the mem- 
ory system takes 80 clock cycles of overhead and then delivers 16 bytes every 2 
clock cycles. Thus, it can supply 16 bytes in 82 clock cycles, 32 bytes in 84 clock 
cycles, and so on. Which block size has the smallest average memory access time 
for each cache size in Figure CI 1? 


Average memory access time is 


Average memory access time = Hit time + Miss rate x Miss penalty 


If we assume the hit time is 1 clock cycle independent of block size, then the 
access time for a 16-byte block in a 4 KB cache is 


Average memory access time = | + (8.57% x 82) = 8.027 clock cycles 
and for a 256-byte block in a 256 KB cache the average memory access time is 


Average memory access time = 1 + (0.49% x 112) = 1.549 clock cycles 
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Cache size 
Block size 4K 16K 64K 256K 
16 8.57% 3.94% 2.04% 1.09% 
32 7.24% 2.87% 135% 0.70% 
64 7.00% 2.64% 106% 0.51% 
128 7.18% 2.77% 102% 0.49% 
256 9.51% 3.29% 1.15% 0.49% 





Figure C.11 Actual miss rate versus block size for five different-sized caches in 
Figure CIO. Note that for a 4 KB cache, 256-byte blocks have a higher miss rate than 
32-byte blocks. In this example, the cache would have to be 256 KB in order for a 256- 
byte block to decrease misses. 























Cache size 
Block size Miss penalty 4K 16K 64K 256K 
16 82 8.027 4.231 2.673 1.894 
32 84 7.082 3.411 2.134 1.588 
64 88 7.160 3.323 1.933 1.449 
128 96 8.469 3.659 1.979 1.470 
256 112 11.651 4.685 2.288 1.549 





Figure C.12 Average memory access time versus block size for five different-sized 
caches in Figure ClO. Block sizes of 32 and 64 bytes dominate.The smallest average 
time per cache size is boldfaced. 


Figure C.12 shows the average memory access time for all block and cache sizes 
between those two extremes. The boldfaced entries show the fastest block size 
for a given cache size: 32 bytes for 4 KB and 64 bytes for the larger caches. 
These sizes are, in fact, popular block sizes for processor caches today. 


As in all of these techniques, the cache designer is trying to minimize both 
the miss rate and the miss penalty. The selection of block size depends on both 
the latency and bandwidth of the lower-level memory. High latency and high 
bandwidth encourage large block size since the cache gets many more bytes per 
miss for a small increase in miss penalty. Conversely, low latency and low band- 
width encourage smaller block sizes since there is little time saved from a larger 
block. For example, twice the miss penalty of a small block may be close to the 
penalty of a block twice the size. The larger number of small blocks may also 
reduce conflict misses. Note that Figures CIO and C.12 show the difference 
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between selecting a block size based on minimizing miss rate versus minimizing 
average memory access time. 

After seeing the positive and negative impact of larger block size on compul- 
sory and capacity misses, the next two subsections look at the potential of higher 
capacity and higher associativity. 


Second Optimization: Larger Caches to Reduce Miss Rate 


The obvious way to reduce capacity misses in Figures C.8 and C.9 is to increase 
capacity of the cache. The obvious drawback is potentially longer hit time and 
higher cost and power. This technique has been especially popular in off-chip 
caches. 


Third Optimization: Higher Associativity to Reduce Miss Rate 


Figures C.8 and C.9 show how miss rates improve with higher associativity. 
There are two general rules of thumb that can be gleaned from these figures. The 
first is that eight-way set associative is for practical purposes as effective in 
reducing misses for these sized caches as fully associative. You can see the differ- 
ence by comparing the eight-way entries to the capacity miss column in Figure 
C.8, since capacity misses are calculated using fully associative caches. 

The second observation, called the 2:1 cache rule of thumb, is that a direct- 
mapped cache of size JV has about the same miss rate as a two-way set-associative 
cache of size N/2. This held in three C's figures for cache sizes less than 128 KB. 

Like many of these examples, improving one aspect of the average memory 
access time comes at the expense of another. Increasing block size reduces miss 
rate while increasing miss penalty, and greater associativity can come at the cost 
of increased hit time. Hence, the pressure of a fast processor clock cycle encour- 
ages simple cache designs, but the increasing miss penalty rewards associativity, 
as the following example suggests. 


Assume higher associativity would increase the clock cycle time as listed below: 


Clock cycle timez.way = 136 X Clock cycle time^^,, 
Clock cycle timeg.way = 144 X Clock cycle time%a,, 
Clock cycle timeg.way = 152 X Clock cycle time way 


Assume that the hit time is 1 clock cycle, that the miss penalty for the direct- 
mapped case is 25 clock cycles to a level 2 cache (see next subsection) that never 
misses, and that the miss penalty need not be rounded to an integral number of 
clock cycles. Using Figure C.8 for miss rates, for which cache sizes are each of 
these three statements true? 


Average memory access timeg_way < Average memory access times_way 
Average memory access time4_way < Average memory access time2.way 
Average memory access time2.way < Average memory access timej_wa 
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Associativity 
Cache size (KB) One-way Two-way Four-way Eight-way 

4 3.44 3.25 3.22 3.28 

8 2.69 2.58 2.55 2.62 

16 2:23 2.40 2.46 2.53 

32 2.06 2.30 2.37 2.45 

64 1.92 2.14 2.18 2.25 

128 1.52 1.84 1.92 2.00 

256 1.32 1.66 1.74 1.82 

512 1.20 1.55 1.59 1.66 





Figure C.13 Average memory access time using miss rates in Figure C.8 for parame- 
ters in the example. Boldface type means that this time is higher than the number to 
the left; that is, higher associativity increases average memory access time. 


Answer Average memory access time for each associativity is 


Average memory access limeg yay = Hit tiMeg-way + Miss rateg.way X Miss penaltyg way = 1.52 + Miss rateg say X 25 


Average memory access limes way = 1.44 + Miss rates... X 25 
Average memory access tiMezway = 1.36 + Miss ratez way X 25 
Average memory access time;_y.,. = 1.00 + Miss rate, way X 25 


The miss penalty is the same time in each case, so we leave it as 25 clock cycles. 
For example, the average memory access time for a4 KB direct-mapped cache is 


Average memory access time way = 1.00 + (0.098 X 25) = 3.44 
and the time for a 512 KB, eight-way set-associative cache is 
Average memory access times way = 152 + (0.006 X 25) = 1.66 


Using these formulas and the miss rates from Figure C.8, Figure C.13 shows the 
average memory access time for each cache and associativity. The figure shows 
that the formulas in this example hold for caches less than or equal to 8 KB for up 
to four-way associativity. Starting with 16 KB, the greater hit time of larger asso- 
ciativity outweighs the time saved due to the reduction in misses. 

Note that we did not account for the slower clock rate on the rest of the program 
in this example, thereby understating the advantage of direct-mapped cache. 


Fourth Optimization: Multilevel Caches to Reduce Miss Penalty 


Reducing cache misses had been the traditional focus of cache research, but the 
cache performance formula assures us that improvements in miss penalty can be 
just as beneficial as improvements in miss rate. Moreover, Figure 5.2 on page 289 
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shows that technology trends have improved the speed of processors faster than 
DRAMs, making the relative cost of miss penalties increase over time. 

This performance gap between processors and memory leads the architect to 
this question: Should I make the cache faster to keep pace with the speed of pro- 
cessors, or make the cache larger to overcome the widening gap between the pro- 
cessor and main memory? 

One answer is, do both. Adding another level of cache between the original 
cache and memory simplifies the decision. The first-level cache can be small 
enough to match the clock cycle time of the fast processor. Yet the second-level 
cache can be large enough to capture many accesses that would go to main mem- 
ory, thereby lessening the effective miss penalty. 

Although the concept of adding another level in the hierarchy is straightfor- 
ward, it complicates performance analysis. Definitions for a second level of 
cache are not always straightforward. Let's start with the definition of average 
memory access time for a two-level cache. Using the subscripts LI and L2 to 
refer, respectively, to a first-level and a second-level cache, the original formula is 


Average memory access time = Hit timeri + Miss rate, x Miss penalty, 


and 
Miss penalty, = Hit timer2 + Miss rater2 x Miss penaltyz2 
so 


Average memory access time = Hit timez, + Miss rate) 
x (Hit timer2 + Miss rater2 x Miss penalty,2) 


In this formula, the second-level miss rate is measured on the leftovers from the 
first-level cache. To avoid ambiguity, these terms are adopted here for a two-level 
cache system: 


e Local miss rate—This rate is simply the number of misses in a cache divided 
by the total number of memory accesses to this cache. As you would expect, 
for the first-level cache it is equal to Miss rate,1, and for the second-level 
cache it is Miss ratez>. 


e Global miss rate—The number of misses in the cache divided by the total 
number of memory accesses generated by the processor. Using the terms 
above, the global miss rate for the first-level cache is still just Miss ratez, but 
for the second-level cache it is Miss rate, x Miss ratez>. 


This local miss rate is large for second-level caches because the first-level 
cache skims the cream of the memory accesses. This is why the global miss rate 
is the more useful measure: It indicates what fraction of the memory accesses 
that leave the processor go all the way to memory. 

Here is a place where the misses per instruction metric shines. Instead of con- 
fusion about local or global miss rates, we just expand memory stalls per instruc- 
tion to add the impact of a second-level cache. 
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Average memory stalls per instruction = Misses per instruction,; x Hit time] 
+ Misses per instruction,2 * Miss penalty,2 


Example Suppose that in 1000 memory references there are 40 misses in the first-level 
cache and 20 misses in the second-level cache. What are the various miss rates? 
Assume the miss penalty from the L2 cache to memory is 200 clock cycles, the 
hit time of the L2 cache is 10 clock cycles, the hit time of LI is 1 clock cycle, and 
there are 15 memory references per instruction. What is the average memory 
access time and average stall cycles per instruction? Ignore the impact of writes. 


Answer The miss rate (either local or global) for the first-level cache is 40/1000 or 4%. 
The local miss rate for the second-level cache is 20/40 or 50%. The global miss 
rate of the second-level cache is 20/1000 or 2%. Then 


Average memory access time = Hit timer; + Miss ratez; x (Hit timer2 + Miss rate_z x Miss penaltyz2) 
= 1 +4% x (10 + 50% x 200) = 1 +4% x 110 = 54 clock cycles 


To see how many misses we get per instruction, we divide 1000 memory refer- 
ences by 15 memory references per instruction, which yields 667 instructions. 
Thus, we need to multiply the misses by 15 to get the number of misses per 1000 
instructions. We have 40 x 1.5 or 60 LI misses, and 20 x 1.5 or 30 L2 misses, per 
1000 instructions. For average memory stalls per instruction, assuming the 
misses are distributed uniformly between instructions and data: 


Average memory stalls per instruction = Misses per instruction, x Hit timer + Misses per instruction 2 
x Miss penalty. 
= (60/1000) x 10 + (30/1000) x 200 
= 0.060 x 10 + 0.030 x 200 = 6.6 clock cycles 


If we subtract the LI hit time from AM AT and then multiply by the average num- 
ber of memory references per instruction, we get the same average memory stalls 
per instruction: 


(5.4- 10 x 15=44 X 15 = 6.6 clock cycles 


As this example shows, there may be less confusion with multilevel caches when 
calculating using misses per instruction versus miss rates. 


Note that these formulas are for combined reads and writes, assuming a write- 
back first-level cache. Obviously, a write-through first-level cache will send all 
writes to the second level, not just the misses, and a write buffer might be used. 

Figures C.14 and C.15 show how miss rates and relative execution time change 
with the size of a second-level cache for one design. From these figures we can gain 
two insights. The first is that the global cache miss rate is very similar to the single 
cache miss rate of the second-level cache, provided that the second-level cache is 
much larger than the first-level cache. Hence, our intuition and knowledge about 
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Figure C.14 Miss rates versus cache size for multilevel caches. Second-level caches 
smallerthan the sum of the two 64 KB first-level caches make little sense, as reflected in 
the high miss rates. After 256 KB the single cache is within 10% of the global miss rates. 
The miss rate of a single-level cache versus size is plotted against the local miss rate and 
global miss rate of a second-level cache using a 32 KB first-level cache. The L2 caches (uni- 
fied) were two-way set associative with LRU replacement. Each had split L1 instruction 
and data caches that were 64 KB two-way set associative with LRU replacement.The block 
size for both L1 and L2 caches was 64 bytes. Data were collected as in Figure C.4. 
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Figure C.I 5 Relative execution time by second-level cache size. The two bars are for 
different clock cycles for an L2 cache hit.The reference execution time of 1.00 is for an 
8192 KB second-level cache with a 1-clock-cycle latency on a second-level hit. These 
data were collected the same way as in Figure C.I 4, using a simulator to imitate the 
Alpha 21264. 


Example 


Answer 
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the first-level caches apply. The second insight is that the local cache miss rate is 
not a good measure of secondary caches; it is a function of the miss rate of the first- 
level cache, and hence can vary by changing the first-level cache. Thus, the global 
cache miss rate should be used when evaluating second-level caches. 

With these definitions in place, we can consider the parameters of second- 
level caches. The foremost difference between the two levels is that the speed of 
the first-level cache affects the clock rate of the processor, while the speed of the 
second-level cache only affects the miss penalty of the first-level cache. Thus, we 
can consider many alternatives in the second-level cache that would be ill chosen 
for the first-level cache. There are two major questions for the design of the 
second-level cache: Will it lower the average memory access time portion of the 
CPI, and how much does it cost? 

The initial decision is the size of a second-level cache. Since everything in the 
first-level cache is likely to be in the second-level cache, the second-level cache 
should be much bigger than the first. If second-level caches are just a little bigger, 
the local miss rate will be high. This observation inspires the design of huge 
second-level caches—the size of main memory in older computers! 

One question is whether set associativity makes more sense for second-level 
caches. 


Given the data below, what is the impact of second-level cache associativity on 
its miss penalty? 
e Hit time; for direct mapped =10 clock cycles. 


e Two-way set associativity increases hit time by 0.1 clock cycles to 10.1 clock 
cycles. 


e Local miss rater: for direct mapped = 25%. 
e Local miss rate, for two-way set associative = 20%. 


e Miss penaltyr2 = 200 clock cycles. 


For a direct-mapped second-level cache, the first-level cache miss penalty is 


Miss penalty !_way12 = 10 + 25% x 200 = 60.0 clock cycles 


Adding the cost of associativity increases the hit cost only 0.1 clock cycles, mak- 
ing the new first-level cache miss penalty 


Miss penalty2.way 12 = 10.1 + 20% x 200 = 50.1 clock cycles 


In reality, second-level caches are almost always synchronized with the first-level 
cache and processor. Accordingly, the second-level hit time must be an integral 
number of clock cycles. If we are lucky, we shave the second-level hit time to 
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10 cycles; if not, we round up to 11 cycles. Either choice is an improvement over 
the direct-mapped second-level cache: 
Miss penalty way 1,2 = 10 + 20% x 200 = 50.0 clock cycles 


Miss penalty> way L2 = 11 + 20% x 200 = 51.0 clock cycles 


Now we can reduce the miss penalty by reducing the miss rate of the second- 
level caches. 

Another consideration concerns whether data in the first-level cache is in the 
second-level cache. Multilevel inclusion is the natural policy for memory hierar- 
chies: LI data are always present in L2. Inclusion is desirable because consis- 
tency between I/O and caches (or among caches in a multiprocessor) can be 
determined just by checking the second-level cache. 

One drawback to inclusion is that measurements can suggest smaller blocks 
for the smaller first-level cache and larger blocks for the larger second-level 
cache. For example, the Pentium 4 has 64-byte blocks in its LI caches and 128- 
byte blocks in its L2 cache. Inclusion can still be maintained with more work on 
a second-level miss. The second-level cache must invalidate all first-level blocks 
that map onto the second-level block to be replaced, causing a slightly higher 
first-level miss rate. To avoid such problems, many cache designers keep the 
block size the same in all levels of caches. 

However, what if the designer can only afford an L2 cache that is slightly big- 
ger than the LI cache? Should a significant portion of its space be used as a 
redundant copy of the LI cache? In such cases a sensible opposite policy is mul- 
tilevel exclusion: LI data is never found in an L2 cache. Typically, with exclusion 
a cache miss in LI results in a swap of blocks between LI and L2 instead of a 
replacement of an LI block with an L2 block. This policy prevents wasting space 
in the L2 cache. For example, the AMD Opteron chip obeys the exclusion prop- 
erty using two 64 KB LI caches and 1 MB L2 cache. 

As these issues illustrate, although a novice might design the first- and 
second-level caches independently, the designer of the first-level cache has a sim- 
pler job given a compatible second-level cache. It is less of a gamble to use a 
write through, for example, if there is a write-back cache at the next level to act 
as a backstop for repeated writes and it uses multilevel inclusion. 

The essence of all cache designs is balancing fast hits and few misses. For 
second-level caches, there are many fewer hits than in the first-level cache, so the 
emphasis shifts to fewer misses. This insight leads to much larger caches and 
techniques to lower the miss rate, such as higher associativity and larger blocks. 


Fifth Optimization: Giving Priority to Read Misses overwrites 
to Reduce Miss Penalty 


This optimization serves reads before writes have been completed. We start with 
looking at the complexities of a write buffer. 


Example 


Answer 
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With a write-through cache the most important improvement is a write buffer 
of the proper size. Write buffers, however, do complicate memory accesses 
because they might hold the updated value of a location needed on a read miss. 


Look at this code sequence: 


SW R3, 512(RO) 3M[512] —R3 (cache index 0) 
LW R1, 1024(RO) ;R1 —M[1024] (cache index 0) 
LW R2, 512(RO) 3 ;R2<M[512] (cache index 0) 


Assume a direct-mapped, write-through cache that maps 512 and 1024 to the 
same block, and a four-word write buffer that is not checked on a read miss. Will 
the value in R2 always be equal to the value in R3? 


Using the terminology from Chapter 2, this is a read-after-write data hazard in 
memory. Let's follow a cache access to see the danger. The data in R3 are placed 
into the write buffer after the store. The following load uses the same cache index 
and is therefore a miss. The second load instruction tries to put the value in loca- 
tion 512 into register R2; this also results in a miss. If the write buffer hasn't 
completed writing to location 512 in memory, the read of location 512 will put 
the old, wrong value into the cache block, and then into R2. Without proper pre- 
cautions, R3 would not be equal to R2! 


The simplest way out of this dilemma is for the read miss to wait until the 
write buffer is empty. The alternative is to check the contents of the write buffer 
on aread miss, and if there are no conflicts and the memory system is available, 
let the read miss continue. Virtually all desktop and server processors use the lat- 
ter approach, giving reads priority over writes. 

The cost of writes by the processor in a write-back cache can also be reduced. 
Suppose a read miss will replace a dirty memory block. Instead of writing the 
dirty block to memory, and then reading memory, we could copy the dirty block 
to a buffer, then read memory, and then write memory. This way the processor 
read, for which the processor is probably waiting, will finish sooner. Similar to 
the previous situation, if aread miss occurs, the processor can either stall until the 
buffer is empty or check the addresses of the words in the buffer for conflicts. 

Now that we have five optimizations that reduce cache miss penalties or miss 
rates, it is time to look at reducing the final component of average memory access 
time. Hit time is critical because it can affect the clock rate of the processor; in 
many processors today the cache access time limits the clock cycle rate, even for 
processors that take multiple clock cycles to access the cache. Hence, a fast hit 
time is multiplied in importance beyond the average memory access time formula 
because it helps everything. 
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Sixth Optimization: Avoiding Address Translation during 
Indexing of the Cache to Reduce Hit Time 


Even a small and simple cache must cope with the translation of a virtual address 
from the processor to a physical address to access memory. As described in Sec- 
tion C.4, processors treat main memory as just another level of the memory hier- 
archy, and thus the address of the virtual memory that exists on disk must be 
mapped onto the main memory. 

The guideline of making the common case fast suggests that we use virtual 
addresses for the cache, since hits are much more common than misses. Such 
caches are termed virtual caches, with physical cache used to identify the tradi- 
tional cache that uses physical addresses. As we will shortly see, it is important to 
distinguish two tasks: indexing the cache and comparing addresses. Thus, the 
issues are whether a virtual or physical address is used to index the cache and 
whether a virtual or physical address is used in the tag comparison. Full virtual 
addressing for both indices and tags eliminates address translation time from a 
cache hit. Then why doesn't everyone build virtually addressed caches? 

One reason is protection. Page-level protection is checked as part of the vir- 
tual to physical address translation, and it must be enforced no matter what. One 
solution is to copy the protection information from the TLB on a miss, add a field 
to hold it, and check it on every access to the virtually addressed cache. 

Another reason is that every time a process is switched, the virtual addresses 
refer to different physical addresses, requiring the cache to be flushed. 
Figure C.16 shows the impact on miss rates of this flushing. One solution is to 
increase the width of the cache address tag with aprocess-identifier tag (PID). If 
the operating system assigns these tags to processes, it only need flush the cache 
when a PID is recycled; that is, the PID distinguishes whether or not the data in 
the cache are for this program. Figure C.16 shows the improvement in miss rates 
by using PIDs to avoid cache flushes. 

A third reason why virtual caches are not more popular is that operating sys- 
tems and user programs may use two different virtual addresses for the same 
physical address. These duplicate addresses, called synonyms or aliases, could 
result in two copies of the same data in a virtual cache; if one is modified, the 
other will have the wrong value. With a physical cache this wouldn't happen, 
since the accesses would first be translated to the same physical cache block. 

Hardware solutions to the synonym problem, called antialiasing, guarantee 
every cache block a unique physical address. The Opteron uses a 64 KB instruc- 
tion cache with an 4 KB page and two-way set associativity, hence the hardware 
must handle aliases involved with the three virtual address bits in the set index. It 
avoids aliases by simply checking all eight possible locations on a miss—two 
blocks in each of four sets—to be sure that none match the physical address of 
the data being fetched. If one is found, it is invalidated, so when the new data are 
loaded into the cache their physical address is guaranteed to be unique. 

Software can make this problem much easier by forcing aliases to share some 
address bits. An older version of UNIX from Sun Microsystems, for example, 


C.3 Six Basic Cache Optimizations * C-37 


E Purge 
W PiDs 


E Uniprocess 





Miss 
rate 


3.9% 
0.4% 4.1% 4.3% 4.3% 4.3% 





2.7% jm 0.4% 
AD beige ace ase ane 
32K 64K 128K 256K 512K 1024K 
Cache size 





0% 





Figure C.16 Miss rate versus virtually addressed cache size of a program measured 
three ways: without process switches (uniprocess), with process switches using a 
process-identifier tag (PID), and with process switches but without PIDs (purge). 
PIDs increase the uniprocess absolute miss rate by 0.3% to 0.6% and save 0.6% to 4.3% 
over purging. Agarwal [1987] collected these statistics for the Ultrix operating system 
running on a VAX, assuming direct-mapped caches with a block size of 16 bytes. Note 
that the miss rate goes up from 128K to 256K. Such nonintuitive behavior can occur in 
caches because changing size changes the mapping of memory blocks onto cache 
blocks, which can change the conflict miss rate. 


required all aliases to be identical in the last 18 bits of their addresses; this 
restriction is called page coloring. Note that page coloring is simply set-associa- 
tive mapping applied to virtual memory: The 4 KB (2'°) pages are mapped using 
64 (2°) sets to ensure that the physical and virtual addresses match in the last 18 
bits. This restriction means a direct-mapped cache that is 2'* (256K) bytes or 
smaller can never have duplicate physical addresses for blocks. From the per- 
spective of the cache, page coloring effectively increases the page offset, as soft- 
ware guarantees that the last few bits of the virtual and physical page address are 
identical. 

The final area of concern with virtual addresses is I/O. I/O typically uses 
physical addresses and thus would require mapping to virtual addresses to inter- 
act with a virtual cache. (The impact of I/O on caches is further discussed in 
Chapter 6.) 
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One alternative to get the best of both virtual and physical caches is to use 
part of the page offset—the part that is identical in both virtual and physical 
addresses—to index the cache. At the same time as the cache is being read using 
that index, the virtual part of the address is translated, and the tag match uses 
physical addresses. 

This alternative allows the cache read to begin immediately, and yet the tag 
comparison is still with physical addresses. The limitation of this virtually 
indexed, physically tagged alternative is that a direct-mapped cache can be no 
bigger than the page size. For example, in the data cache in Figure C.5 on page 
C-13, the index is 9 bits and the cache block offset is 6 bits. To use this trick, the 
virtual page size would have to be at least 2°*® bytes or 32 KB. If not, a portion 
of the index must be translated from virtual to physical address. 

Associativity can keep the index in the physical part of the address and yet 
still support a large cache. Recall that the size of the index is controlled by this 
formula: 

„Index _ Cache size 
Block size x Set associativity 





For example, doubling associativity and doubling the cache size does not change 
the size of the index. The IBM 3033 cache, as an extreme example, is 16-way set 
associative, even though studies show there is little benefit to miss rates above 8- 
way set associativity. This high associativity allows a 64 KB cache to be 
addressed with a physical index, despite the handicap of 4 KB pages in the IBM 
architecture. 


Summary of Basic Cache Optimization 


The techniques in this section to improve miss rate, miss penalty, and hit time 
generally impact the other components of the average memory access equation as 
well as the complexity of the memory hierarchy. Figure C.17 summarizes these 
techniques and estimates the impact on complexity, with + meaning that the tech- 
nique improves the factor, - meaning it hurts that factor, and blank meaning it has 
no impact. No optimization in this figure helps more than one category. 


Virtual Memory 


... a system has been devised to make the core drum combination appear to the 
programmer as a single level store, the requisite transfers taking place 
automatically. 

Kilburnetal.[1962] 


At any instant in time computers are running multiple processes, each with its 
own address space. (Processes are described in the next section.) It would be too 
expensive to dedicate a full address space worth of memory for each process, 
especially since many processes use only a small part of their address space. 
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Technique time penalty rate complexity Comment 

Larger block size - + 0 Trivial; Pentium 4 L2 uses 128 bytes 

Larger cache size - + 1 Widely used, especially for L2 
caches 

Higher associativity - + 1 Widely used 

Multilevel caches + 2 Costly hardware; harder if LI block 
size ^ L2 block size; widely used 

Read priority over writes + 1 Widely used 

Avoiding address translation during + 1 Widely used 


cache indexing 





Figure C.17 Summary of basic cache optimizations showing impact on cache performance and complexity for 
the techniques in this appendix. Generally a technique helps only one factor.+ means that the technique improves 
the factor, - means it hurts that factor, and blank means it has no impact.The complexity measure is subjective, with 
0 being the easiest and 3 being a challenge. 


Hence, there must be a means of sharing a smaller amount of physical memory 
among many processes. 

One way to do this, virtual memory, divides physical memory into blocks and 
allocates them to different processes. Inherent in such an approach must be “pro- 
tection scheme that restricts a process to the blocks belonging only to that pro- 
cess. Most forms of virtual memory also reduce the time to start a program, since 
not all code and data need be in physical memory before a program can begin. 

Although protection provided by virtual memory is essential for current com- 
puters, sharing is not the reason that virtual memory was invented. If a program 
became too large for physical memory, it was the programmer's job to make it fit. 
Programmers divided programs into pieces, then identified the pieces that were 
mutually exclusive, and loaded or unloaded these overlays under user program 
control during execution. The programmer ensured that the program never tried 
to access more physical main memory than was in the computer, and that the 
proper overlay was loaded at the proper time. As you can well imagine, this 
responsibility eroded programmer productivity. 

Virtual memory was invented to relieve programmers of this burden; it auto- 
matically manages the two levels of the memory hierarchy represented by main 
memory and secondary storage. Figure C.18 shows the mapping of virtual mem- 
ory to physical memory for a program with four pages. 

In addition to sharing protected memory space and automatically managing 
the memory hierarchy, virtual memory also simplifies loading the program for 
execution. Called relocation, this mechanism allows the same program to run in 
any location in physical memory. The program in Figure C.18 can be placed any- 
where in physical memory or disk just by changing the mapping between them. 
(Prior to the popularity of virtual memory, processors would include a relocation 
register just for that purpose.) An alternative to a hardware solution would be 
software that changed all addresses in a program each time it was run. 
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Figure C.I 8 The logical program in its contiguous virtual address space is shown on 
the left. It consists of four pages A,B,C,and D.The actual location of three of the blocks 
is in physical main memory and the other is located on the disk. 


Several general memory hierarchy ideas from Chapter 1 about caches are 
analogous to virtual memory, although many of the terms are different. Page or 
segment is used for block, and page fault or address fault is used for miss. With 
virtual memory, the processor produces virtual addresses that are translated by a 
combination of hardware and software to physical addresses, which access main 
memory. This process is called memory mapping or address translation. Today, 
the two memory hierarchy levels controlled by virtual memory are DRAMs and 
magnetic disks. Figure C.19 shows a typical range of memory hierarchy parame- 
ters for virtual memory. 

There are further differences between caches and virtual memory beyond 
those quantitative ones mentioned in Figure C.19: 


e Replacement on cache misses is primarily controlled by hardware, while vir- 
tual memory replacement is primarily controlled by the operating system. 
The longer miss penalty means it's more important to make a good decision, 
so the operating system can be involved and take time deciding what to 
replace. 


e The size of the processor address determines the size of virtual memory, but 
the cache size is independent of the processor address size. 


e In addition to acting as the lower-level backing store for main memory in the 
hierarchy, secondary storage is also used for the file system. In fact, the file 
system occupies most of secondary storage. It is not normally in the address 
space. 


Virtual memory also encompasses several related techniques. Virtual memory 
systems can be categorized into two classes: those with fixed-size blocks, called 
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Parameter First-level cache Virtual memory 

Block (page) size 16-128 bytes 4096-65,536 bytes 

Hit time 1-3 clock cycles 100-200 clock cycles 

Miss penalty 8-200 clock cycles 1,000,000-10,000,000 clock cycles 
(access time) (6-160 clock cycles) (800,000-8,000,000 clock cycles) 
(transfer time) (2-40 clock cycles) (200,000-2,000,000 clock cycles) 

Miss rate 0.1-10% 0.00001-0.001% 





Address mapping 25-45 bit physical address 32-64 bit virtual address to 25-45 
to 14-20 bit cache address bit physical address 





Figure C.19 Typical ranges of parameters for caches and virtual memory. Virtual 
memory parameters represent increases of 10-1,000,000 times over cache parameters. 
Normally first-level caches contain at most 1 MB of data, while physical memory con- 
tains 256 MB to 1 TB. 
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Figure C.20 Example of how paging and segmentation divide a program. 


pages, and those with variable-size blocks, called segments. Pages are typically 
fixed at 4096 to 8192 bytes, while segment size varies. The largest segment sup- 
ported on any processor ranges from 2'° bytes up to 2°” bytes; the smallest seg- 
ment is 1 byte. Figure C.20 shows how the two approaches might divide code 
and data. 

The decision to use paged virtual memory versus segmented virtual memory 
affects the processor. Paged addressing has a single fixed-size address divided 
into page number and offset within a page, analogous to cache addressing. A sin- 
gle address does not work for segmented addresses; the variable size of segments 
requires 1 word for a segment number and 1 word for an offset within a segment, 
for a total of 2 words. An unsegmented address space is simpler for the compiler. 

The pros and cons of these two approaches have been well documented in 
operating systems textbooks; Figure C.21 summarizes the arguments. Because of 
the replacement problem (the third line of the figure), few computers today use 
pure segmentation. Some computers use a hybrid approach, called paged 
segments, in which a segment is an integral number of pages. This simplifies 
replacement because memory need not be contiguous, and the full segments need 
not be in main memory. A more recent hybrid is for a computer to offer multiple 
page sizes, with the larger sizes being powers of 2 times the smallest page size. 
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Figure C.21 Paging versus segmentation. Both can waste memory, depending on the 
block size and how well the segments fit together in main memory. Programming lan- 
guages with unrestricted pointers require both the segment and the address to be 
passed. A hybrid approach, called paged segments, shoots for the best of both worlds: 
Segments are composed of pages, so replacing a block is easy, yet a segment may be 
treated as a logical unit. 


The IBM 405CR embedded processor, for example, allows 1 KB, 4 KB (27 x 
1 KB), 16 KB (2f x 1 KB), 64 KB (2°x 1 KB), 256 KB (2°x 1 KB), 1024 KB 
(2'°x 1 KB), and 4096 KB (2'?x 1 KB) to act as a single page. 


Four Memory Hierarchy Questions Revisited 


We are now ready to answer the four memory hierarchy questions for virtual 
memory. 


07: Where Cana Block Be Placed in Main Memory? 


The miss penalty for virtual memory involves access to a rotating magnetic stor- 
age device and is therefore quite high. Given the choice of lower miss rates or a 
simpler placement algorithm, operating systems designers normally pick lower 
miss rates because of the exorbitant miss penalty. Thus, operating systems allow 
blocks to be placed anywhere in main memory. According to the terminology in 
Figure C.2 on page C-7, this strategy would be labeled fully associative. 


Q2: How Is a Block Found If It ls in Main Memory? 


Both paging and segmentation rely on a data structure that is indexed by the page 
or segment number. This data structure contains the physical address of the 
block. For segmentation, the offset is added to the segment's physical address to 
obtain the final physical address. For paging, the offset is simply concatenated to 
this physical page address (see Figure C.22). 
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Figure C.22 The mapping ofa virtual address to a physical address via a page table. 


This data structure, containing the physical page addresses, usually takes the 
form of a page table. Indexed by the virtual page number, the size of the table is 
the number of pages in the virtual address space. Given a 32-bit virtual address, 
4 KB pages, and 4 bytes per Page Table Entry (PTE), the size of the page table 
would be (2°7/2'”) X 27 = 2” or 4 MB. 

To reduce the size of this data structure, some computers apply a hashing 
function to the virtual address. The hash allows the data structure to be the length 
of the number of physical pages in main memory. This number could be much 
smaller than the number of virtual pages. Such a structure is called an inverted 
page table. Using the previous example, a 512 MB physical memory would only 
need 1 MB (8 x 512 MB/4 KB) for an inverted page table; the extra 4 bytes per 
page table entry are for the virtual address. The HP/Intel IA-64 covers both bases 
by offering both traditional pages tables and inverted page tables, leaving the 
choice of mechanism to the operating system programmer. 

To reduce address translation time, computers use a cache dedicated to these 
address translations, called a translation lookaside buffer, or simply translation 
buffer, described in more detail shortly. 


Q3: Which Block Should Be Replaced on a Virtual Memory Miss? 


As mentioned earlier, the overriding operating system guideline is minimizing 
page faults. Consistent with this guideline, almost all operating systems try to 
replace the least-recently used (LRU) block because if the past predicts the 
future, that is the one less likely to be needed. 

To help the operating system estimate LRU, many processors provide a use 
bit or reference bit, which is logically set whenever a page is accessed. (To reduce 
work, it is actually set only on a translation buffer miss, which is described 
shortly.) The operating system periodically clears the use bits and later records 
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them so it can determine which pages were touched during a particular time 
period. By keeping track in this way, the operating system can select a page that 
is among the least-recently referenced. 


Q4: What Happens ona Write? 


The level below main memory contains rotating magnetic disks that take millions 
of clock cycles to access. Because of the great discrepancy in access time, no one 
has yet built a virtual memory operating system that writes through main memory 
to disk on every store by the processor. (This remark should not be interpreted as 
an opportunity to become famous by being the first to build one!) Thus, the write 
strategy is always write back. 

Since the cost of an unnecessary access to the next-lower level is so high, vir- 
tual memory systems usually include a dirty bit. It allows blocks to be written to 
disk only if they have been altered since being read from the disk. 


Techniques for Fast Address Translation 


Page tables are usually so large that they are stored in main memory and are 
sometimes paged themselves. Paging means that every memory access logically 
takes at least twice as long, with one memory access to obtain the physical 
address and a second access to get the data. As mentioned in Chapter 5, we use 
locality to avoid the extra memory access. By keeping address translations in a 
special cache, a memory access rarely requires a second access to translate the 
data. This special address translation cache is referred to as a translation looka- 
side buffer (TLB), also called a translation buffer (TB). 

A TLB entry is like a cache entry where the tag holds portions of the virtual 
address and the data portion holds a physical page frame number, protection 
field, valid bit, and usually a use bit and dirty bit. To change the physical page 
frame number or protection of an entry in the page table, the operating system 
must make sure the old entry is not in the TLB; otherwise, the system won't 
behave properly. Note that this dirty bit means the corresponding page is dirty, 
not that the address translation in the TLB is dirty nor that a particular block in 
the data cache is dirty. The operating system resets these bits by changing the 
value in the page table and then invalidates the corresponding TLB entry. When 
the entry is reloaded from the page table, the TLB gets an accurate copy of the 
bits. 

Figure C.23 shows the Opteron data TLB organization, with each step of the 
translation labeled. This TLB uses fully associative placement; thus, the transla- 
tion begins (steps 1 and 2) by sending the virtual address to all tags. Of course, 
the tag must be marked valid to allow a match. At the same time, the type of 
memory access is checked for a violation (also in step 2) against protection infor- 
mation in the TLB. 

For reasons similar to those in the cache case, there is no need to include the 
12 bits of the page offset in the TLB. The matching tag sends the corresponding 
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Figure C.23 Operation of the Opteron data TLB during address translation. The four 
steps of a TLB hit are shown as circled numbers. This TLB has 40 entries. Section C5 
describes the various protection and access fields of an Opteron page table entry. 


physical address through effectively a 40:1 multiplexor (step 3). The page offset 
is then combined with the physical page frame to form a full physical address 
(step 4). The address size is 40 bits. 

Address translation can easily be on the critical path determining the clock 
cycle of the processor, so the Opteron uses virtually addressed, physically tagged 
LI caches. 


Selecting a Page Size 


The most obvious architectural parameter is the page size. Choosing the page is a 
question of balancing forces that favor a larger page size versus those favoring a 
smaller size. The following favor a larger size: 


e The size of the page table is inversely proportional to the page size; memory 
(or other resources used for the memory map) can therefore be saved by mak- 
ing the pages bigger. 

e As mentioned in Section C.3, a larger page size can allow larger caches with 
fast cache hit times. 


¢ Transferring larger pages to or from secondary storage, possibly over a net- 
work, is more efficient than transferring smaller pages. 


e The number of TLB entries is restricted, so a larger page size means that 
more memory can be mapped efficiently, thereby reducing the number of 
TLB misses. 


It is for this final reason that recent microprocessors have decided to support mul- 
tiple page sizes; for some programs, TLB misses can be as significant on CPI as 
the cache misses. 
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The main motivation for a smaller page size is conserving storage. A small 
page size will result in less wasted storage when a contiguous region of virtual 
memory is not equal in size to a multiple of the page size. The term for this 
unused memory in a page is internal fragmentation. Assuming that each process 
has three primary segments (text, heap, and stack), the average wasted storage 
per process will be 15 times the page size. This amount is negligible for comput- 
ers with hundreds of megabytes of memory and page sizes of 4 KB to 8 KB. Of 
course, when the page sizes become very large (more than 32 KB), storage (both 
main and secondary) could be wasted, as well as I/O bandwidth. A final concern 
is process start-up time; many processes are small, so a large page size would 
lengthen the time to invoke a process. 


Summary of Virtual Memory and Caches 


With virtual memory, TLBs, first-level caches, and second-level caches all map- 
ping portions of the virtual and physical address space, it can get confusing what 
bits go where. Figure C.24 gives a hypothetical example going from a 64-bit vir- 
tual address to a 41-bit physical address with two levels of cache. This LI cache 
is virtually indexed, physically tagged since both the cache size and the page size 
are 8 KB. The L2 cache is 4 MB. The block size for both is 64 bytes. 

First, the 64-bit virtual address is logically divided into a virtual page number 
and page offset. The former is sent to the TLB to be translated into a physical 
address, and the high bit of the latter is sent to the LI cache to act as an index. If 
the TLB match is a hit, then the physical page number is sent to the LI cache tag 
to check for a match. If it matches, it's an LI cache hit. The block offset then 
selects the word for the processor. 

If the LI cache check results in a miss, the physical address is then used to try 
the L2 cache. The middle portion of the physical address is used as an index to 
the 4 MB L2 cache. The resulting L2 cache tag is compared to the upper part of 
the physical address to check for a match. If it matches, we have an L2 cache hit, 
and the data are sent to the processor, which uses the block offset to select the 
desired word. On an L2 miss, the physical address is then used to get the block 
from memory. 

Although this is a simple example, the major difference between this drawing 
and a real cache is replication. First, there is only one LI cache. When there are 
two LI caches, the top half of the diagram is duplicated. Note this would lead to 
two TLBs, which is typical. Hence, one cache and TLB is for instructions, driven 
from the PC, and one cache and TLB is for data, driven from the effective 
address. 

The second simplification is that all the caches and TLBs are direct mapped. 
If any were n-way set associative, then we would replicate each set of tag mem- 
ory, comparators, and data memory n times and connect data memories with an 
n: 1 multiplexor to select a hit. Of course, if the total cache size remained the 
same, the cache index would also shrink by log2n bits according to the formula in 
Figure C.7 on page C-21. 
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Figure C.24 The overall picture of a hypothetical memory hierarchy going from virtual address to L2 cache 
access. The page size is 8 KB. The TLB is direct mapped with 256 entries.The L1 cache is a direct-mapped 8 KB, and 
the L2 cache is a direct-mapped 4 MB. Both use 64-byte blocks. The virtual address is 64 bits and the physical address 
is 41 bits. The primary difference between this simple figure and a real cache is replication of pieces of this figure. 


C5 Protection and Examples of Virtual Memory 


The invention of multiprogramming, where a computer would be shared by 
several programs running concurrently, led to new demands for protection and 
sharing among programs. These demands are closely tied to virtual memory in 
computers today, and so we cover the topic here along with two examples of vir- 
tual memory. 

Multiprogramming leads to the concept of a process. Metaphorically, a pro- 
cess is a program's breathing air and living space—that is, a running program 
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plus any state needed to continue running it. Time-sharing is a variation of multi- 
programming that shares the processor and memory with several interactive users 
at the same time, giving the illusion that all users have their own computers. 
Thus, at any instant it must be possible to switch from one process to another. 
This exchange is called a process switch or context switch. 

A process must operate correctly whether it executes continuously from start 
to finish, or it is interrupted repeatedly and switched with other processes. The 
responsibility for maintaining correct process behavior is shared by designers of 
the computer and the operating system. The computer designer must ensure that 
the processor portion of the process state can be saved and restored. The operat- 
ing system designer must guarantee that processes do not interfere with each oth- 
ers' computations. 

The safest way to protect the state of one process from another would be to 
copy the current information to disk. However, a process switch would then take 
seconds—far too long for a time-sharing environment. 

This problem is solved by operating systems partitioning main memory so 
that several different processes have their state in memory at the same time. This 
division means that the operating system designer needs help from the computer 
designer to provide protection so that one process cannot modify another. 
Besides protection, the computers also provide for sharing of code and data 
between processes, to allow communication between processes or to save mem- 
ory by reducing the number of copies of identical information. 


Protecting Processes 


Processes can be protected from one another by having their own page tables, 
each pointing to distinct pages of memory. Obviously, user programs must be 
prevented from modifying their page tables or protection would be circumvented. 

Protection can be escalated, depending on the apprehension of the computer 
designer or the purchaser. Rings added to the processor protection structure 
expand memory access protection from two levels (user and kernel) to many 
more. Like a military classification system of top secret, secret, confidential, and 
unclassified, concentric rings of security levels allow the most trusted to access 
anything, the second most trusted to access everything except the innermost 
level, and so on. The "civilian" programs are the least trusted and, hence, have 
the most limited range of accesses. There may also be restrictions on what pieces 
of memory can contain code—execute protection—and even on the entrance 
point between the levels. The Intel 80x86 protection structure, which uses rings, 
is described later in this section. It is not clear whether rings are an improvement 
in practice over the simple system of user and kernel modes. 

As the designer's apprehension escalates to trepidation, these simple rings 
may not suffice. Restricting the freedom given a program in the inner sanctum 
requires a new classification system. Instead of a military model, the analogy of 
this system is to keys and locks: A program can't unlock access to the data unless 
it has the key. For these keys, or capabilities, to be useful, the hardware and oper- 
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ating system must be able to explicitly pass them from one program to another 
without allowing a program itself to forge them. Such checking requires a great 
deal of hardware support if time for checking keys is to be kept low. 

The 80x86 architecture has tried several of these alternatives over the years. 
Since backwards compatibility is one of the guidelines of this architecture, the 
most recent versions of the architecture include all of its experiments in virtual 
memory. We'll go over two of the options here: first, the older segmented address 
space and then the newer flat, 64-bit address space. 


A Segmented Virtual Memory Example: 
Protection in the Intel Pentium 


The second system is the most dangerous system a man ever designs.... The 
general tendency is to over-design the second system, using all the ideas and frills 
that were cautiously sidetracked on the first one. 


F.P.Brooks, Jr. 
The Mythical Man-Month (1975) 


The original 8086 used segments for addressing, yet it provided nothing for vir- 
tual memory or for protection. Segments had base registers but no bound regis- 
ters and no access checks, and before a segment register could be loaded the 
corresponding segment had to be in physical memory. Intel's dedication to virtual 
memory and protection is evident in the successors to the 8086, with a few fields 
extended to support larger addresses. This protection scheme is elaborate, with 
many details carefully designed to try to avoid security loopholes. We'll refer to it 
as IA-32. The next few pages highlight a few of the Intel safeguards; if you find 
the reading difficult, imagine the difficulty of implementing them! 

The first enhancement is to double the traditional two-level protection model: 
the IA-32 has four levels of protection. The innermost level (0) corresponds to the 
traditional kernel mode, and the outermost level (3) is the least privileged mode. 
The IA-32 has separate stacks for each level to avoid security breaches between 
the levels. There are also data structures analogous to traditional page tables that 
contain the physical addresses for segments, as well as a list of checks to be made 
on translated addresses. 

The Intel designers did not stop there. The IA-32 divides the address space, 
allowing both the operating system and the user access to the full space. The IA- 
32 user can call an operating system routine in this space and even pass 
parameters to it while retaining full protection. This safe call is not a trivial 
action, since the stack for the operating system is different from the user's stack. 
Moreover, the IA-32 allows the operating system to maintain the protection level 
of the called routine for the parameters that are passed to it. This potential loop- 
hole in protection is prevented by not allowing the user process to ask the operat- 
ing system to access something indirectly that it would not have been able to 
access itself. (Such security loopholes are called Trojan horses.) 
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The Intel designers were guided by the principle of trusting the operating sys- 
tem as little as possible, while supporting sharing and protection. As an example of 
the use of such protected sharing, suppose a payroll program writes checks and also 
updates the year-to-date information on total salary and benefits payments. Thus, 
we want to give the program the ability to read the salary and year-to-date informa- 
tion, and modify the year-to-date information but not the salary. We will see the 
mechanism to support such features shortly. In the rest of this subsection, we will 
look at the big picture of the IA-32 protection and examine its motivation. 


Adding Bounds Checking and Memory Mapping 


The first step in enhancing the Intel processor was getting the segmented address- 
ing to check bounds as well as supply a base. Rather than a base address, the seg- 
ment registers in the IA-32 contain an index to a virtual memory data structure 
called a descriptor table. Descriptor tables play the role of traditional page tables. 
On the IA-32 the equivalent of a page table entry is a segment descriptor. It con- 
tains fields found in PTEs: 


e Present bit—Equivalent to the PTE valid bit, used to indicate this is a valid 
translation 


e Base field—Equivalent to a page frame address, containing the physical 
address of the first byte of the segment 


e Access bit—Like the reference bit or use bit in some architectures that is 
helpful for replacement algorithms 


e Attributes field—Specifies the valid operations and protection levels for 
operations that use this segment 


There is also a limit field, not found in paged systems, which establishes the 
upper bound of valid offsets for this segment. Figure C.25 shows examples of IA- 
32 segment descriptors. 

IA-32 provides an optional paging system in addition to this segmented 
addressing. The upper portion of the 32-bit address selects the segment 
descriptor, and the middle portion is an index into the page table selected by the 
descriptor. We describe below the protection system that does not rely on paging. 


Adding Sharing and Protection 


To provide for protected sharing, half of the address space is shared by all pro- 
cesses and half is unique to each process, called global address space and local 
address space, respectively. Each half is given a descriptor table with the appro- 
priate name. A descriptor pointing to a shared segment is placed in the global 
descriptor table, while a descriptor for a private segment is placed in the local 
descriptor table. 

A program loads an IA-32 segment register with an index to the table and a 
bit saying which table it desires. The operation is checked according to the 
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Figure C.25 The IA-32 segment descriptors are distinguished by bits in the 
attributes field. Base, limit, present, readable, and writable are all self-explanatory. D 
gives the default addressing size of the instructions: 16 bits or 32 bits. G gives the gran- 
ularity of the segment limit: 0 means in bytes and 1 means in 4 KB pages. G is set to 1 
when paging is turned on to set the size of the page tables. DPL means descriptor privi- 
lege levetthis is checked against the code privilege level to see if the access will be 
allowed. Conforming says the code takes on the privilege level of the code being called 
rather than the privilege level of the caller; it is used for library routines. The expand- 
down field flips the check to let the base field be the high-water mark and the limit field 
be the low-water mark. As you might expect, this is used for stack segments that grow 
down. Word count controls the number of words copied from the current stack to the 
new stack on a call gate. The other two fields of the call gate descriptor, destination 
selectorand destination offset, select the descriptor of the destination of the call and the 
offset into it, respectively. There are many more than these three segment descriptors 
in the IA-32 protection model. 


attributes in the descriptor, the physical address being formed by adding the off- 
set in the processor to the base in the descriptor, provided the offset is less than 
the limit field. Every segment descriptor has a separate 2-bit field to give the legal 
access level of this segment. A violation occurs only if the program tries to use a 
segment with a lower protection level in the segment descriptor. 

We can now show how to invoke the payroll program mentioned above to 
update the year-to-date information without allowing it to update salaries. The 
program could be given a descriptor to the information that has the writable field 
clear, meaning it can read but not write the data. A trusted program can then be 
supplied that will only write the year-to-date information. It is given a descriptor 
with the writable field set (Figure C.25). The payroll program invokes the trusted 
code using a code segment descriptor with the conforming field set. This setting 
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means the called program takes on the privilege level of the code being called 
rather than the privilege level of the caller. Hence, the payroll program can read 
the salaries and call a trusted program to update the year-to-date totals, yet the 
payroll program cannot modify the salaries. If a Trojan horse exists in this sys- 
tem, to be effective it must be located in the trusted code whose only job is to 
update the year-to-date information. The argument for this style of protection is 
that limiting the scope of the vulnerability enhances security. 


Adding Safe Calls from User to OS Gates and Inheriting Protection 
Level for Parameters 


Allowing the user to jump into the operating system is a bold step. How, then, can 
a hardware designer increase the chances of a safe system without trusting the 
operating system or any other piece of code? The IA-32 approach is to restrict 
where the user can enter a piece of code, to safely place parameters on the proper 
stack, and to make sure the user parameters don't get the protection level of the 
called code. 

To restrict entry into others’ code, the IA-32 provides a special segment 
descriptor, or call gate, identified by a bit in the attributes field. Unlike other 
descriptors, call gates are full physical addresses of an object in memory; the off- 
set supplied by the processor is ignored. As stated above, their purpose is to pre- 
vent the user from randomly jumping anywhere into a protected or more 
privileged code segment. In our programming example, this means the only place 
the payroll program can invoke the trusted code is at the proper boundary. This 
restriction is needed to make conforming segments work as intended. 

What happens if caller and callee are "mutually suspicious," so that neither 
trusts the other? The solution is found in the word count field in the bottom 
descriptor in Figure C.25. When a call instruction invokes a call gate descriptor, 
the descriptor copies the number of words specified in the descriptor from the 
local stack onto the stack corresponding to the level of this segment. This copy- 
ing allows the user to pass parameters by first pushing them onto the local stack. 
The hardware then safely transfers them onto the correct stack. A return from a 
call gate will pop the parameters off both stacks and copy any return values to the 
proper stack. Note that this model is incompatible with the current practice of 
passing parameters in registers. 

This scheme still leaves open the potential loophole of having the operating 
system use the user's address, passed as parameters, with the operating system's 
security level, instead of with the user's level. The IA-32 solves this problem by 
dedicating 2 bits in every processor segment register to the requested protection 
level. When an operating system routine is invoked, it can execute an instruction 
that sets this 2-bit field in all address parameters with the protection level of the 
user that called the routine. Thus, when these address parameters are loaded into 
the segment registers, they will set the requested protection level to the proper 
value. The IA-32 hardware then uses the requested protection level to prevent any 
foolishness: No segment can be accessed from the system routine using those 
parameters if it has a more privileged protection level than requested. 


A Paged Virtual Memory Example: 
The 64-Bit Opteron Memory Management 


AMD engineers found few uses of the elaborate protection model described 
above. The popular model is a flat, 32-bit address space, introduced by the 
80386, which sets all the base values of the segment registers to zero. Hence, 
AMD dispensed with the multiple segments in the 64-bit mode. It assumes that 
the segment base is zero and ignores the limit field. The page sizes are 4 KB, 
2 MB, and 4 MB. 

The 64-bit virtual address of the AMD64 architecture is mapped onto 52-bit 
physical addresses, although implementations can implement fewer bits to sim- 
plify hardware. The Opteron, for example, uses 48-bit virtual addresses and 40- 
bit physical addresses. AMD64 requires that the upper 16 bits of the virtual 
address be just the sign extension of the lower 48 bits, which it calls canonical 
form. 

The size of page tables for the 64-bit address space is alarming. Hence, 
AMD64 uses a multilevel hierarchical page table to map the address space to 
keep the size reasonable. The number of levels depends on the size of the virtual 
address space. Figure C.26 shows the four-level translation of the 48-bit virtual 
addresses of the Opteron. 

The offsets for each of these page tables come from four 9-bit fields. Address 
translation starts with adding the first offset to the page-map level 4 base register 
and then reading memory from this location to get the base of the next-level page 
table. The next address offset is in turn added to this newly fetched address, and 
memory is accessed again to determine the base of the third page table. It hap- 
pens again in the same fashion. The last address field is added to this final base 
address, and memory is read using this sum to (finally) get the physical address 
of the page being referenced. This address is concatenated with the 12-bit page 
offset to get the full physical address. Note that page table in the Opteron archi- 
tecture fits within a single 4 KB page. 

The Opteron uses a 64-bit entry in each of these page tables. The first 12 bits 
are reserved for future use, the next 52 bits contain the physical page frame num- 
ber, and the last 12 bits give the protection and use information. Although the 
fields vary some between the page table levels, here are the basic ones: 


e Presence—Says that page is present in memory. 
e Read/write—Says whether page is read-only or read-write. 


e User/supervisor—Says whether a user can access the page or if it is limited 
to upper three privilege levels. 


e Dirty—Says if page has been modified. 
e Accessed—Says if page has been read or written since the bit was last 
cleared. 


e Page size—Says whether last level is for 4 KB pages or 4 MB pages; if it's 
the latter, then the Opteron only uses three instead of four levels of pages. 
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Figure C.26 The mapping of an Opteron virtual address. The Opteron virtual memory implementation with four 
page table levels supports an effective physical address size of 40 bits. Each page table has 512 entries, so each level 
field is 9 bits wide. The AMD64 architecture document allows the virtual address size to grow from the current 48 bits 
to 64 bits, and the physical address size to grow from the current 40 bits to 52 bits. 


e No execute—Not found in the 80386 protection scheme, this bit was added to 
prevent code from executing in some pages. 


e Page level cache disable—Says whether the page can be cached or not. 


e Page level write through—Says whether the page allows write back or write 
through for data caches. 


Since the Opteron normally goes through four levels of tables on a TLB miss, 
there are three potential places to check protection restrictions. The Opteron 
obeys only the bottom-level Page Table Entry, checking the others only to be sure 
the valid bit is set. 

As the entry is 8 bytes long, each page table has 512 entries, and the Opteron 
has 4 KB pages, the page tables are exactly one page long. Each of the four level 
fields are 9 bits long and the page offset is 12 bits. This derivation leaves 64 - 
(4x 9 + 12) or 16 bits to be sign extended to ensure canonical addresses 

Although we have explained translation of legal addresses, what prevents the 
user from creating illegal address translations and getting into mischief? The 
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Parameter Description 

Block size 1 PTE (8 bytes) 

LI hit time 1 clock cycle 

L2 hit time 7 clock cycles 

LI TLB size same for instruction and data TLBs: 40 PTEs per TLBs, with 32 
4 KB pages and 8 for 2M or 4M pages 

L2 TLB size same for instruction and data TLBs: 512 PTEs of 4 KB pages 

Block selection LRU 

Write strategy (not applicable) 

LI block placement fully associative 

L2 block placement 4-way set associative 





Figure C.27 Memory hierarchy parameters of the Opteron LI and L2 instruction and 
data TLBs. 


page tables themselves are protected from being written by user programs. Thus, 
the user can try any virtual address, but by controlling the page table entries the 
operating system controls what physical memory is accessed. Sharing of memory 
between processes is accomplished by having a page table entry in each address 
space point to the same physical memory page. 

The Opteron employs four TLBs to reduce address translation time, two for 
instruction accesses and two for data accesses. Like multilevel caches, the 
Opteron reduces TLB misses by having two larger L2 TLBs: one for instructions 
and one for data. Figure C.27 describes the data TLB. 


Summary: Protection on the 32-Bit Intel Pentium vs. the 
64-Bit AMD Opteron 


Memory management in the Opteron is typical of most desktop or server comput- 
ers today, relying on page-level address translation and correct operation of the 
operating system to provide safety to multiple processes sharing the computer. 
Although presented as alternatives, Intel has followed AMD's lead and embraced 
the AMD64 architecture. Hence, both AMD and Intel support the 64-bit exten- 
sion of 80x86, yet, for compatibility reasons, both support the elaborate seg- 
mented protection scheme. 

If the segmented protection model looks harder to build than the AMD64 
model, that's because it is. This effort must be especially frustrating for the engi- 
neers, since few customers use the elaborate protection mechanism. In addition, 
the fact that the protection model is a mismatch to the simple paging protection 
of UNIX-like systems means it will be used only by someone writing an operat- 
ing system especially for this computer, which hasn't happened yet. 
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Pitfall 


Pitfall 


Pitfall 


Fallacies and Pitfalls 


Even a review of memory hierarchy has fallacies and pitfalls! 
Too small an address space. 


Just five years after DEC and Carnegie Mellon University collaborated to design 
the new PDP-11 computer family, it was apparent that their creation had a fatal 
flaw. An architecture announced by IBM six years before the PDP-11 was still 
thriving, with minor modifications, 25 years later. And the DEC VAX, criticized 
for including unnecessary functions, sold millions of units after the PDP-11 went 
out of production. Why? 

The fatal flaw of the PDP-11 was the size of its addresses (16 bits) as com- 
pared to the address sizes of the IBM 360 (24 to 31 bits) and the VAX (32 bits). 
Address size limits the program length, since the size of a program and the 
amount of data needed by the program must be less than 2““""**S‘!”", The reason 
the address size is so hard to change is that it determines the minimum width of 
anything that can contain an address: PC, register, memory word, and effective- 
address arithmetic. If there is no plan to expand the address from the start, then 
the chances of successfully changing address size are so slim that it normally 
means the end of that computer family. Bell and Strecker [1976] put it like this: 


There is only one mistake that can be made in computer design that is difficult to 
recover from—not having enough address bits for memory addressing and mem- 
ory management. The PDP-11 followed the unbroken tradition of nearly every 
known computer, [p. 2] 


A partial list of successful computers that eventually starved to death for lack of 
address bits includes the PDP-8, PDP-10, PDP-11, Intel 8080, Intel 8086, Intel 
80186, Intel 80286, Motorola 6800, AMI 6502, Zilog Z80, CRAY-1, and CRAY 
X-MP. 

The venerable 80x86 line bears the distinction of having been extended twice, 
first to 32 bits with the Intel 80386 in 1985 and recently to 64 bits with the AMD 
Opteron. 


Ignoring the impact of the operating system on the performance of the memory 
hierarchy. 


Figure C.28 shows the memory stall time due to the operating system spent on 
three large workloads. About 25% of the stall time is either spent in misses in the 
operating system or results from misses in the application programs because of 
interference with the operating system. 


Relying on the operating systems to change the page size over time. 


The Alpha architects had an elaborate plan to grow the architecture over time by 
growing its page size, even building it into the size of its virtual address. When it 
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= % time 
os Data OS misses 
Inherent conflicts os Data misses Rest and 
% in %in application with instruction missesfor inblock ofOS application 
Workload applications OS misses applications misses migration operations misses conflicts 
Pmake 47% 53% 14.1 % 4.8% 10.9% 1.0% 6.2% 2.9% 25.8% 
Multipgm 53% 47% 21.6% 3.4% 9.2% 4.2% 4.7% 3.4% 24.9% 
Oracle 73% 27% 25.7% 10.2% 10.6% 2.6% 0.6% 2.8% 26.8% 





Figure C.28 Misses and time spent in misses for applications and operating system. The operating system adds 
about 25% to the execution time of the application. Each processor has a 64 KB instruction cache and a two-level 
data cache with 64 KB in the first level and 256 KB in the second level; all caches are direct mapped with 16-byte 
blocks. Collected on Silicon Graphics POWER station 4D/340, a multiprocessor with four 33 MHz R3000 processors 
running three application workloads under a UNIX System V—Pmake: a parallel compile of 56 files; Multipgm: the 
parallel numeric program MP3D running concurrently with Pmake and a five-screen edit session; and Oracle: run- 
ning a restricted version of the TP-1 benchmark using the Oracle database. (Data from Torrellas, Gupta, and Hen- 


nessy[1992].) 


came time to grow page sizes with later Alphas, the operating system designers 
balked and the virtual memory system was revised to grow the address space 
while maintaining the 8 KB page. 

Architects of other computers noticed very high TLB miss rates, and so 
added multiple, larger page sizes to the TLB. The hope was that operating sys- 
tems programmers would allocate an object to the largest page that made sense, 
thereby preserving TLB entries. After a decade of trying, most operating systems 
use these "superpages" only for handpicked functions: mapping the display 
memory or other I/O devices, or using very large pages for the database code. 


Concluding Remarks 


The difficulty of building a memory system to keep pace with faster processors is 
underscored by the fact that the raw material for main memory is the same as that 
found in the cheapest computer. It is the principle of locality that helps us here— 
its soundness is demonstrated at all levels of the memory hierarchy in current 
computers, from disks to TLBs. 

However, the increasing relative latency to memory, taking hundreds of 
clock cycles in 2006, means that programmers and compiler writers must be 
aware of the parameters of the caches and TLBs if they want their programs to 
perform well. 
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C.8 Historical Perspective and References 


In Section K.6 on the companion CD we examine the history of caches, virtual 
memory, and virtual machines. IBM plays a prominent role in this history. Refer- 
ences for further reading are included. 
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NEITHER ELSEVIER NOR ITS LICENSORS REPRESENT OR WARRANT THAT THE INFORMATION 
CONTAINED IN THE PROPRIETARY MATERIAL IS COMPLETE OR FREE FROM ERROR, AND NEI- 
THER ASSUMES, AND BOTH EXPRESSLY DISCLAIM, ANY LIABILITY TO ANY PERSON FOR ANY 
LOSS OR DAMAGE CAUSED BY ERRORS OR OMISSIONS IN THE PROPRIETARY MATERIAL, 
WHETHER SUCH ERRORS OR OMISSIONS RESULT FROM NEGLIGENCE, ACCIDENT, OR ANY 
OTHER CAUSE. IN ADDITION, NEITHER ELSEVTER NOR ITS LICENSORS MAKE ANY REPRESEN- 
TATIONS OR WARRANTIES, EITHER EXPRESS OR IMPLIED, REGARDING THE PERFORMANCE 
OF YOUR NETWORK OR COMPUTER SYSTEM WHEN USED IN CONJUNCTION WITH THE 
ELECTRONIC MEDIA PRODUCT. 


If this Electronic Media Product is defective, Elsevier will replace it at no charge if the defective Electronic 
Media Product is returned to Elsevier within sixty (60) days (or the greatest period allowable by applicable 
law) from the date of shipment. 


Elsevier warrants that the software embodied in this Electronic Media Product will perform in substantial 
compliance with the documentation supplied in this Electronic Media Product. If You report a significant 
defect in performance in writing to Elsevier, and Elsevier is not able to correct same within sixty (60) days 
after its receipt of Your notification, You may return this Electronic Media Product, including all copies and 
documentation, to Elsevier and Elsevier will refund Your money. 


YOU UNDERSTAND THAT, EXCEPT FOR THE 60-DAY LIMITED WARRANTY RECITED 
ABOVE, ELSEVIER, ITS AFFILIATES, LICENSORS, SUPPLffiRS AND AGENTS, MAKE NO 
WARRANTIES, EXPRESSED OR IMPLIED, WITH RESPECT TO THE ELECTRONIC MEDIA 
PRODUCT, INCLUDING, WITHOUT LIMITATION THE PROPRIETARY MATERIAL, AND 
SPECIFICALLY DISCLAIM ANY WARRANTY OF MERCHANTABILITY OR FITNESS FOR A 
PARTICULAR PURPOSE. 


If the information provided on this Electronic Media Product contains medical or health sciences informa- 
tion, it is intended for professional use within the medical field. Information about medical treatment or drug 
dosages is intended strictly for professional use, and because of rapid advances in the medical sciences, inde- 
pendent verification of diagnosis and drug dosages should be made. 


IN NO EVENT WILL ELSEVIER, ITS AFFILIATES, LICENSORS, SUPPLIERS OR AGENTS, BE 
LIABLE TO YOU FOR ANY DAMAGES, INCLUDING, WITHOUT LIMITATION, ANY LOST PROFITS, 
LOST SAVINGS OR OTHER INCIDENTAL OR CONSEQUENTIAL DAMAGES, ARISING OUT OF 
YOUR USE OR INABILITY TO USE THE ELECTRONIC MEDIA PRODUCT REGARDLESS OF 
WHETHER SUCH DAMAGES ARE FORESEEABLE OR WHETHER SUCH DAMAGES ARE DEEMED 
TO RESULT FROM THE FAILURE OR INADEQUACY OF ANY EXCLUSIVE OR OTHER REMEDY. 


UJS. GOVERNMENT RESTRICTED RIGHTS 


The Electronic Media Product and documentation are provided with restricted rights. Use, duplication or dis- 
closure by the U.S. Government is subject to restrictions as set forth in subparagraphs (a) through (d) of the 
Commercial Computer Restricted Rights clause at FAR 52.22719 or in subparagraph (c)(1)(ii) of the Rights 
in Technical Data and Computer Software clause at DEARS 252.2277013, or at 252.2117015, as applicable. 
Contractor/Manufacturer is Elsevier Inc., 360 Park Avenue South, New York, NY 10010-5107 USA. 


GOVERNING LAW 


This Agreement shall be governed by the laws of the State of New York, USA. In any dispute arising out of 
this Agreement, you and Elsevier each consent to the exclusive personal jurisdiction and venue in the state 
and federal courts within New York County, New York, USA. 


Subset of the Instructions in MIPS64 


Instruction type/opcode 


Data transfers 


LB.LBU.SB 
LH,LHU,SH 
LW.LWU.SW 

LD.SD 
L.S,L.D,S.S,S.D 
MFCO.MTCO 
MOV.S,MOV.D 
MFC1.MTC1 
ArithmeticNogical 
DADD.DADDI.DADDU, DADDIU 
DSUB.DSUBU 


DMUL, DMULU, DDIV, 
DDIVU.MADD 


AND. ANDI 
OR.ORI.XOR.XORI 
LUI 


DSLL,DSRL,DSRA,DSLLV, 
DSRLV.DSRAV 


SLT.SLTI.SLTU.SLTIU 


Control 
BEQZ.BNEZ 
BEQ.BNE 
BC1T,BC1F 
MOVNIMOVZ 
J,JR 
JAL.JALR 
TRAP 


ERET 
Floating point 


ADD.D.ADD.S.ADD.PS 
SUB.D,SUB.S,ADD.PS 
MUL.D,MUL.S,MUL.PS 
MADD.D.MADD.S.MADD.PS 
DIV.D,DIV.S,DIV.PS 
CVT._. 


Instruction meaning 

Move data between registers and memory, or between the integer and FP or special registers; only 
memory address mode is 16-bit displacement + contents ofa GPR 

Load byte, load byte unsigned, store byte (to/from integer registers) 

Load half word, load half word unsigned, store half word (to/from integer registers) 
Load word, load word unsigned, store word (to/from integer registers) 

Load double word, store double word (to/from integer registers) 

Load SP float, load DP float, store SP float, store DP float 

Copy from/to GPR to/from a special register 

Copy one SP or DP FP register to another FP register 

Copy 32 bits from/to FP registers to/from integer registers 

Operations on integer or logical data in GPRs; signed arithmetic trap on overflow 
Add, add immediate (all immediates are 16 bits); signed and unsigned 

Subtract; signed and unsigned 


Multiply and divide, signed and unsigned; multiply-add; all operations take and yield 64-bit 
values 


And, and immediate 

Or, or immediate, exclusive or, exclusive or immediate 

Load upper immediate; loads bits 32 to 47 of register with immediate, then sign-extends 

Shifts: both immediate (DS__) and variable form (DS___V); shifts are shift left logical, right logical, 
right arithmetic 


Set less than, set less than immediate; signed and unsigned 
Conditional branches and jumps; PC-relative or through register 


Branch GPR equal/not equal to zero; 16-bit offset from PC + 4 

Branch GPR equal/not equal; 16-bit offset from PC + 4 

Test comparison bit in the FP status register and branch; 16-bit offset from PC + 4 
Copy GPR to another GPR if third GPR is negative, zero 

Jumps: 26-bit offset from PC + 4 (J) or target in register (JR) 

Jump and link: save PC + 4 in R31, target is PC-relative (JAL) or a register (JALR) 
Transfer to operating system at a vectored address 

Return to user code from an exception; restore user mode 

FP operations on DP and SP formats 

Add DP, SP numbers, and pairs of SP numbers 

Subtract DP, SP numbers, and pairs of SP numbers 

Multiply DP, SP floating point, and pairs of SP numbers 

Multiply-add DP, SP numbers and pairs of SP numbers 

Divide DP, SP floating point, and pairs of SP numbers 


Convert instructions: CVT. x. y converts from type x to type y, where x and y are L (64-bit integer), 
W (32-bit integer), D (DP), or S (SP). Both operands are FPRs. 


DP and SP compares: "_ LT,GT, LE,GE,EQ,NE; sets bit in FP status register 








LOMPUTER ARCHITECTURE 


A Quantitative Approach, Fourth Edition 
John L. Hennessy and David A. Patterson 


The multiprocessor is here and it can no longer be avoided. As we bid farewell 
to single-core processors and move into the chip multiprocessing age, it is great 
timing for a new edition of Hennessy and Patterson's classic. Few books have 
had as significant an impact on the way their discipline is taught, and the current 
edition will ensure its place at the top for some time to come. 

—Luiz André Barroso, Googie Inc, 


What do the following have in common: Beatles’ tunes, HP calculators, chocolate 
chip cookies, and Computer Architecture? They are all classics that have stood 


She tank of 7 —Robert P. Colwell, intel isad architect 


The best-selling computer architecture book redefines the field with a new edition updated to 
address the historic shif from single-core to multicore processors. Using their acclaimed quantita 
tive approach to computing design, Hennessy and Patterson explore a variety of techniques for 
achieving peralieliam—the key to unlocking the power of multiple- processor architectures. While 
refocusing on multiprocessor issues, the authors also cover design factors beyond processor 
performanca, including power, reliability, availability, and dependability. 


Sustaining the last century's improvements in cost and performance wil! require continuing innova- 
tions—innovations founded in large part on the fundamental techniques covered in thie classic text. 


Fourth Edition Features 

B Trademark Putting it All Together sections highlight the latest technology from industry, including 
the Sun Niagara, AMD Opteron, and Pentium 4 

B® Review appendices cover the basic and intermediate principles that the main text relies upon. 

m Reference appendices on the companion CD—some guest-authored by subject experte—cover a 
range of topics including embedded systems, vector processors, interconnection networks, and 
large-scale multiprocessors. 

© Case Stucies contributed by experts in industry and academia are accompanied by increasingly 
complex exercises to explore the key concepts covered in each chapter, 





Online support materials for this book are available 
ut textbooks alseverr com 17370490! 
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